Observability Guide

Loki-VL-proxy emits the same core operational signals in two forms:

Signal	Transport	Format	Best use
Metrics	Pull	Prometheus text at `/metrics`	Prometheus, Grafana Agent, Alloy, VictoriaMetrics, kube scraping
Metrics	Push	OTLP HTTP JSON to `/v1/metrics`	OpenTelemetry Collector, vendor OTLP gateways
Logs	Stream	Structured JSON on stdout/stderr	Fluent Bit, Vector, OpenTelemetry Collector, Docker/Kubernetes log agents

The intent is parity, not two separate products. Prometheus scrape and OTLP push carry the same proxy-centric metric families, units, and low-cardinality request dimensions for the important operational paths. Prometheus uses label keys such as system, direction, endpoint, route, and status; OTLP exports the same dimensions as semantically aligned attributes such as loki.api.system, proxy.direction, loki.request.type, http.route, and http.response.status_code.

That shared model is what makes the packaged dashboard portable across scrape-backed and OTLP-backed setups without rewriting the operator view:

request volume and latency
backend latency
cache hit/miss behavior
translation failures
tenant and client hot spots
client status and query-length outliers
process/runtime and host-level health

The proxy also now keeps more expensive metadata paths deliberately warmer than live log queries:

live query and tail paths stay on short TTLs so fresh log visibility is not stretched unnecessarily
slower-changing metadata such as labels, field lists, field values, and patterns are cached more aggressively
Drilldown prefers backend-native metadata discovery where it is safe, which reduces proxy-side rescans and lowers CPU pressure on repeated field/label browsing

Observability Endpoints

Endpoint	Purpose
`GET /ready`	Readiness probe (checks backend `/health` and circuit-breaker state)
`GET /metrics`	Prometheus text exposition (`-server.register-instrumentation`, bounded by `-server.metrics-max-concurrency`)
`GET /debug/queries`	Query analytics endpoint (disabled by default, `-server.enable-query-analytics`)
`GET /debug/pprof/`	Go pprof profiling endpoints (disabled by default, `-server.enable-pprof`)

Logs

JSON Log Shape

Default logs are emitted as JSON and already use OTel-friendly top-level keys:

{
  "timestamp": "2026-04-05T18:03:27.214918Z",
  "severity": {
    "text": "INFO",
    "number": 9
  },
  "body": "request",
  "component": "proxy",
  "http.route": "/loki/api/v1/query_range",
  "url.path": "/loki/api/v1/query_range",
  "http.request.method": "GET",
  "http.response.status_code": 200,
  "loki.request.type": "query_range",
  "loki.api.system": "loki",
  "proxy.direction": "downstream",
  "event.duration": 42000000,
  "loki.tenant.id": "team-a",
  "loki.query": "{service_name=\"api\"} |= \"error\"",
  "client.address": "10.0.0.12:51884",
  "enduser.id": "grafana-user@example.com",
  "enduser.source": "grafana_user",
  "cache.result": "miss",
  "proxy.duration_ms": 42,
  "upstream.calls": 1,
  "upstream.calls_by_type": {
    "vl:select_logsql_query": 1
  },
  "upstream.status_code": 200,
  "upstream.duration_ms": 31,
  "upstream.duration_ms_by_type": {
    "vl:select_logsql_query": 31
  },
  "proxy.operations_by_type": {
    "translate_query:translated": 1
  },
  "proxy.operation_duration_ms_by_type": {
    "translate_query:translated": 4
  },
  "proxy.overhead_ms": 11
}

That makes the log stream usable in two ways:

plain JSON ingestion with no transformation
low-friction mapping into the OpenTelemetry log data model

Log Sources

The proxy writes structured logs for:

request lifecycle and status
query translation and backend request flow
tail/WebSocket behavior
delete audit events
cache warmer and disk cache internals
OTLP export failures

OpenTelemetry Fields Used in Logs

Field	Meaning
`timestamp`	event time
`severity.text` / `severity.number`	log severity
`body`	message body
`component`	internal subsystem (`proxy`, `disk_cache`, `cache_warmer`, `otlp_metrics`)
`http.*` / `url.path`	request semantics and normalized route vs actual request path
`http.parent_route`	parent downstream route template on upstream child-call logs
`event.duration`	request or upstream call duration in nanoseconds
`client.address`	remote address
`enduser.id`	stable trusted user/client identity when available
`enduser.name`	display/login user name from trusted user headers when available
`enduser.source`	trusted header source for end-user attribution (`grafana_user`, `forwarded_user`, etc.)
`auth.*`	datasource/auth principal context (separate from `enduser.id`)
`cache.result`	compatibility cache result (`hit`, `miss`, `bypass`)
`proxy.*`	proxy-facing convenience fields such as total request duration and measured proxy overhead
`upstream.*`	backend call count, status, and latency
`loki.*`	Loki/proxy-specific attributes

Additional request-scope aggregate fields used for fanout visibility:

Field	Meaning
`loki.parent_request.type`	parent downstream request type on upstream child-call logs
`upstream.calls_by_type`	per-parent aggregate map keyed by `<system>:<request_type>`
`upstream.duration_ms_by_type`	per-parent aggregate latency map keyed by `<system>:<request_type>`
`proxy.operations_by_type`	per-parent aggregate map keyed by `<operation>:<outcome>` for proxy-only work
`proxy.operation_duration_ms_by_type`	per-parent aggregate latency map keyed by `<operation>:<outcome>`

These aggregate map keys are intentionally bounded by route templates and hardcoded operation/outcome enums. They are log fields, not metric labels.

Metrics

Export Modes

Prometheus Scrape

scrape_configs:
  - job_name: loki-vl-proxy
    scrape_interval: 15s
    static_configs:
      - targets:
          - loki-vl-proxy:3100

OTLP Push

./loki-vl-proxy \
  -backend=http://victorialogs:9428 \
  -otlp-endpoint=http://otel-collector:4318/v1/metrics \
  -otlp-interval=30s \
  -otlp-compression=gzip \
  -otlp-headers='Authorization=Bearer example-token'

If the OTLP endpoint is passed as a collector base URL like http://collector:4318 or http://collector:4318/v1, the proxy normalizes it to /v1/metrics.

OpenTelemetry Resource Attributes for Metrics and Logs

These flags shape OTLP metric resource attributes. Structured logs intentionally do not duplicate resource attributes per line; keep service identity in collector/OTLP resource metadata to avoid message.service.* duplication in storage.

Flag	Meaning
`-otel-service-name`	`service.name`
`-otel-service-namespace`	`service.namespace`
`-otel-service-instance-id`	`service.instance.id`
`-deployment-environment`	`deployment.environment.name`

Request Dimensions

Request-oriented metrics use stable low-cardinality dimensions so dashboards can slice by user-visible API shape without leaking raw paths or query content.

Dimension	Prometheus scrape	OTLP push	Example
API system	`system`	`loki.api.system`	`loki`, `vl`
Direction	`direction`	`proxy.direction`	`downstream`, `upstream`
Request type	`endpoint`	`loki.request.type`	`query_range`, `labels`, `patterns`
Route template	`route`	`http.route`	`/loki/api/v1/query_range`, `/select/logsql/query`
Final status	`status`	`http.response.status_code`	`200`, `429`, `500`

Downstream routes are the normalized Loki API templates registered by the proxy. Upstream routes are the stable VictoriaLogs or rules/alerts backend path templates used by the proxy itself. Raw request paths and query strings stay in logs, not in metric labels.

Tenant and client metric families are the only intentionally high-cardinality families, and even those are bounded with -metrics.max-tenants and -metrics.max-clients; excess identities collapse to __overflow__.

Histogram helper series (_bucket, _sum, _count) inherit the same label set and cardinality as the parent metric family.

Cardinality Levels

Level	Meaning
`Low`	no labels or only fixed route templates / small enums (`status`, `direction`, `mode`, `reason`)
`Medium`	bounded internal enums that may grow slowly with feature surface but not with traffic shape
`High (capped)`	user or tenant identity dimensions; bounded by `-metrics.max-tenants` / `-metrics.max-clients` with `__overflow__` fallback

Core Proxy Metrics

All rows below are exposed through Prometheus scrape and OTLP push unless noted otherwise.

Metric	Type	Labels	Cardinality	Description
`loki_vl_proxy_requests_total`	counter	`system`, `direction`, `endpoint`, `route`, `status`	`Low`	all proxied requests, sliced by downstream Loki path or upstream backend path
`loki_vl_proxy_request_duration_seconds`	histogram	`system`, `direction`, `endpoint`, `route`	`Low`	end-to-end request latency
`loki_vl_proxy_backend_duration_seconds`	histogram	`system`, `direction`, `endpoint`, `route`	`Low`	upstream backend latency only (`system="vl"`, `direction="upstream"`)
`loki_vl_proxy_upstream_calls_per_request`	histogram	`system`, `direction`, `endpoint`, `route`	`Low`	number of upstream child requests fanned out under a single downstream request
`loki_vl_proxy_cache_hits_total`	counter	none	`Low`	global cache hits
`loki_vl_proxy_cache_misses_total`	counter	none	`Low`	global cache misses
`loki_vl_proxy_cache_hits_by_endpoint`	counter	`system`, `direction`, `endpoint`, `route`	`Low`	cache hits per normalized route
`loki_vl_proxy_cache_misses_by_endpoint`	counter	`system`, `direction`, `endpoint`, `route`	`Low`	cache misses per normalized route
`loki_vl_proxy_translations_total`	counter	none	`Low`	successful LogQL to LogsQL translations
`loki_vl_proxy_translation_errors_total`	counter	none	`Low`	failed translations
`loki_vl_proxy_internal_operation_total`	counter	`operation`, `outcome`	`Medium`	proxy-only work such as translation, parser preference, and response-label rewrites
`loki_vl_proxy_internal_operation_duration_seconds`	histogram	`operation`, `outcome`	`Medium`	latency spent in proxy-only work not covered by backend timings
`loki_vl_proxy_coalesced_total`	counter	none	`Low`	requests served from coalesced results
`loki_vl_proxy_coalesced_saved_total`	counter	none	`Low`	backend requests saved by coalescing
`loki_vl_proxy_response_tuple_mode_total`	counter	`mode`	`Low`	emitted log tuple contract mode by client behavior (Prometheus scrape only today)
`loki_vl_proxy_uptime_seconds`	gauge	none	`Low`	process uptime
`loki_vl_proxy_active_requests`	gauge	none	`Low`	current in-flight requests
`loki_vl_proxy_circuit_breaker_state`	gauge	none	`Low`	`0=closed`, `1=open`, `2=half-open`
`loki_vl_proxy_http_connections`	gauge	`state`	`Low`	current downstream HTTP server connections by state
`loki_vl_proxy_http_connection_transitions_total`	counter	`state`	`Low`	downstream HTTP server connection state transitions
`loki_vl_proxy_http_connection_rotations_total`	counter	`reason`	`Low`	downstream HTTP/1.x connection rotations triggered by the proxy

Operational notes for these hot paths:

query_range and labels benchmarks in CI track both cache-hit and cache-bypass behavior
multi-tenant read fanout and merged response bodies are capped to keep a single request from exhausting proxy memory
synthetic tail keeps bounded dedup state so long-running websocket sessions do not grow without limit

Query-Range Windowing Metrics

These are the primary signals for long-range query performance and backend protection:

Metric	Type	Labels	Cardinality	Description
`loki_vl_proxy_window_cache_hit_total`	counter	none	`Low`	cached split windows served without backend scan
`loki_vl_proxy_window_cache_miss_total`	counter	none	`Low`	split windows requiring backend scan
`loki_vl_proxy_window_fetch_seconds`	histogram	none	`Low`	backend fetch duration per split window
`loki_vl_proxy_window_merge_seconds`	histogram	none	`Low`	merge duration for split-window responses
`loki_vl_proxy_window_count`	histogram	none	`Low`	split windows per `query_range` request
`loki_vl_proxy_window_prefilter_attempt_total`	counter	none	`Low`	prefilter runs against `/select/logsql/hits`
`loki_vl_proxy_window_prefilter_error_total`	counter	none	`Low`	prefilter failures (proxy safely falls back to full window fanout)
`loki_vl_proxy_window_prefilter_kept_total`	counter	none	`Low`	split windows retained for real log fanout
`loki_vl_proxy_window_prefilter_skipped_total`	counter	none	`Low`	split windows skipped as empty by prefilter
`loki_vl_proxy_window_prefilter_hit_ratio`	gauge	none	`Low`	current prefilter kept/total ratio (0-1)
`loki_vl_proxy_window_retry_total`	counter	none	`Low`	per-window retry attempts after retryable backend failures
`loki_vl_proxy_window_degraded_batch_total`	counter	none	`Low`	batches that were downgraded to lower parallelism
`loki_vl_proxy_window_partial_response_total`	counter	none	`Low`	partial query-range responses returned when slow windows exceed budget
`loki_vl_proxy_window_prefilter_duration_seconds`	histogram	none	`Low`	prefilter latency
`loki_vl_proxy_window_adaptive_parallel_current`	gauge	none	`Low`	current adaptive split-window parallelism
`loki_vl_proxy_window_adaptive_latency_ewma_seconds`	gauge	none	`Low`	adaptive EWMA latency
`loki_vl_proxy_window_adaptive_error_ewma`	gauge	none	`Low`	adaptive EWMA backend error ratio

Patterns Snapshot Metrics

These metrics track the proxy-side pattern cache and snapshot lifecycle.

Metric	Type	Labels	Cardinality	Description
`loki_vl_proxy_patterns_detected_total`	counter	none	`Low`	unique patterns detected from pattern mining
`loki_vl_proxy_patterns_stored_total`	counter	none	`Low`	pattern entries stored in proxy cache or snapshot updates
`loki_vl_proxy_patterns_restored_from_disk_total`	counter	none	`Low`	pattern entries restored from on-disk snapshots
`loki_vl_proxy_patterns_restored_from_peers_total`	counter	none	`Low`	pattern entries restored from peer snapshots
`loki_vl_proxy_patterns_restored_disk_entries_total`	counter	none	`Low`	snapshot cache keys restored from disk
`loki_vl_proxy_patterns_restored_peer_entries_total`	counter	none	`Low`	snapshot cache keys restored from peers
`loki_vl_proxy_patterns_deduplicated_total`	counter	`source`	`Low`	duplicate pattern snapshot entries removed by source (`mem`, `disk`, `peer`)
`loki_vl_proxy_patterns_in_memory`	gauge	none	`Low`	current number of patterns held in in-memory snapshot state
`loki_vl_proxy_patterns_cache_keys`	gauge	none	`Low`	current number of pattern cache keys held in memory
`loki_vl_proxy_patterns_in_memory_bytes`	gauge	none	`Low`	current bytes used by in-memory pattern snapshot payloads
`loki_vl_proxy_patterns_last_response_patterns`	gauge	none	`Low`	pattern entries returned in the most recent `/patterns` response
`loki_vl_proxy_patterns_last_response_bytes`	gauge	none	`Low`	encoded size of the most recent `/patterns` response
`loki_vl_proxy_patterns_persisted_disk_entries`	gauge	none	`Low`	snapshot cache keys present in the last persisted disk snapshot
`loki_vl_proxy_patterns_persisted_disk_patterns`	gauge	none	`Low`	pattern entries present in the last persisted disk snapshot
`loki_vl_proxy_patterns_persisted_disk_bytes`	gauge	none	`Low`	last persisted pattern snapshot size on disk
`loki_vl_proxy_patterns_persist_writes_total`	counter	none	`Low`	completed pattern snapshot writes to disk
`loki_vl_proxy_patterns_persist_write_bytes_total`	counter	none	`Low`	cumulative bytes written by pattern snapshot persistence
`loki_vl_proxy_patterns_restored_disk_bytes_total`	counter	none	`Low`	cumulative bytes restored from on-disk pattern snapshots
`loki_vl_proxy_patterns_restored_peer_bytes_total`	counter	none	`Low`	cumulative bytes restored from peer snapshot warmup
`loki_vl_proxy_patterns_source_lines_requested_total`	counter	none	`Low`	source lines requested from backend pattern fetches
`loki_vl_proxy_patterns_source_lines_scanned_total`	counter	none	`Low`	source lines scanned from backend responses
`loki_vl_proxy_patterns_source_lines_observed_total`	counter	none	`Low`	source lines accepted into the pattern miner
`loki_vl_proxy_patterns_windows_attempted_total`	counter	none	`Low`	pattern fetch windows attempted
`loki_vl_proxy_patterns_windows_accepted_total`	counter	none	`Low`	pattern fetch windows accepted into the merged response
`loki_vl_proxy_patterns_windows_capped_total`	counter	none	`Low`	pattern fetch windows that hit the per-window source line cap
`loki_vl_proxy_patterns_second_pass_windows_total`	counter	none	`Low`	pattern fetch windows retried with a higher line limit
`loki_vl_proxy_patterns_mined_pre_merge_total`	counter	none	`Low`	pattern entries mined before cross-window merge
`loki_vl_proxy_patterns_mined_post_merge_total`	counter	none	`Low`	pattern entries after cross-window merge
`loki_vl_proxy_patterns_snapshot_hits_total`	counter	none	`Low`	pattern snapshot fallback lookups that found cached payloads
`loki_vl_proxy_patterns_snapshot_misses_total`	counter	none	`Low`	pattern snapshot fallback lookups that missed
`loki_vl_proxy_patterns_snapshot_reused_total`	counter	none	`Low`	cached snapshot payloads actually reused in `/patterns` responses
`loki_vl_proxy_patterns_low_coverage_responses_total`	counter	none	`Low`	responses flagged as likely degraded by capped or incomplete mining coverage

Peer Cache Metrics

These families are currently exposed on Prometheus scrape at /metrics.

Metric	Type	Labels	Cardinality	Description
`loki_vl_proxy_peer_cache_peers`	gauge	none	`Low`	remote peers currently in the fleet-cache ring
`loki_vl_proxy_peer_cache_cluster_members`	gauge	none	`Low`	total fleet-cache ring members including self
`loki_vl_proxy_peer_cache_hits_total`	counter	none	`Low`	successful peer-cache fetches
`loki_vl_proxy_peer_cache_misses_total`	counter	none	`Low`	peer-cache lookups that missed on the owner
`loki_vl_proxy_peer_cache_errors_total`	counter	none	`Low`	peer-cache fetch errors
`loki_vl_proxy_peer_cache_write_through_pushes_total`	counter	none	`Low`	successful owner write-through pushes from non-owner peers
`loki_vl_proxy_peer_cache_write_through_errors_total`	counter	none	`Low`	owner write-through push errors
`loki_vl_proxy_peer_cache_hot_index_requests_total`	counter	none	`Low`	peer hot-index requests
`loki_vl_proxy_peer_cache_hot_index_errors_total`	counter	none	`Low`	peer hot-index request errors
`loki_vl_proxy_peer_cache_read_ahead_prefetches_total`	counter	none	`Low`	successful hot read-ahead prefetches
`loki_vl_proxy_peer_cache_read_ahead_prefetch_bytes_total`	counter	none	`Low`	bytes prefetched by hot read-ahead
`loki_vl_proxy_peer_cache_read_ahead_budget_drops_total`	counter	none	`Low`	hot read-ahead candidates dropped by budget or size filters
`loki_vl_proxy_peer_cache_read_ahead_tenant_skips_total`	counter	none	`Low`	hot read-ahead candidates skipped by tenant fairness

Tenant and Client Metrics

These are the metrics to use when you want to identify the users or tenants actually causing backend load.

Metric	Type	Labels	Cardinality	Description
`loki_vl_proxy_tenant_requests_total`	counter	`system`, `direction`, `tenant`, `endpoint`, `route`, `status`	`High (capped)`	request volume by tenant
`loki_vl_proxy_tenant_request_duration_seconds`	histogram	`system`, `direction`, `tenant`, `endpoint`, `route`	`High (capped)`	latency by tenant
`loki_vl_proxy_client_requests_total`	counter	`system`, `direction`, `client`, `endpoint`, `route`	`High (capped)`	request volume by client identity
`loki_vl_proxy_client_response_bytes_total`	counter	`client`	`High (capped)`	response bytes by client
`loki_vl_proxy_client_status_total`	counter	`system`, `direction`, `client`, `endpoint`, `route`, `status`	`High (capped)`	final status breakdown by client
`loki_vl_proxy_client_inflight_requests`	gauge	`client`	`High (capped)`	current parallelism by client
`loki_vl_proxy_client_request_duration_seconds`	histogram	`system`, `direction`, `client`, `endpoint`, `route`	`High (capped)`	request latency by client
`loki_vl_proxy_client_query_length_chars`	histogram	`system`, `direction`, `client`, `endpoint`, `route`	`High (capped)`	query size outliers by client
`loki_vl_proxy_client_errors_total`	counter	`system`, `direction`, `endpoint`, `route`, `reason`	`Low`	categorized downstream client errors

This is one of the main advantages of putting an explicit proxy between the Grafana Loki datasource and VictoriaLogs: the read path becomes attributable.

Instead of only seeing aggregate datasource traffic, operators can see:

which Grafana user or trusted client identity is generating load
which tenant is hot
which route is expensive for that client or tenant
which client is producing the largest responses, longest queries, or most bad requests

Grafana Client Visibility, Offenders, and User Patterns

When -metrics.trust-proxy-headers=true is enabled behind a trusted Grafana or auth proxy, the proxy can turn northbound identity into durable read-path signals without using raw datasource credentials as the end-user key.

That gives you:

per-client request rate by route via loki_vl_proxy_client_requests_total
per-client latency by route via loki_vl_proxy_client_request_duration_seconds
per-client response-volume visibility via loki_vl_proxy_client_response_bytes_total
per-client query-size outlier visibility via loki_vl_proxy_client_query_length_chars
per-client bad-request and error clustering via loki_vl_proxy_client_status_total and loki_vl_proxy_client_errors_total
per-tenant volume and latency visibility via loki_vl_proxy_tenant_*

Those tenant and client identity series are opt-in. Set -metrics.export-sensitive-labels=true only on trusted scrape or OTLP paths where exposing identity labels is acceptable.

At log level, the same request can also carry:

enduser.id
enduser.name
enduser.source
auth.principal
auth.source
loki.tenant.id
http.route

Per-request by-type breakdown maps such as upstream.calls_by_type and proxy.operations_by_type are emitted only at debug level. The default info-level request logs keep aggregate counts while the detailed per-type visibility stays in Prometheus/OTLP metrics. This avoids log-body field explosion in pipelines that flatten structured JSON bodies into discoverable message.* fields.

That separation matters:

enduser.* answers "which Grafana user or trusted client triggered this?"
auth.* answers "which datasource or auth principal was used on the request path?"
loki.tenant.id answers "which tenant boundary did the request execute in?"

This is what makes offender analysis practical on the read path instead of only looking at coarse IP-level traffic.

Northbound and Southbound Auth Boundaries

The same proxy layer also improves trust separation between components.

Boundary	Main controls	Why it matters operationally
Grafana or client -> proxy	`-auth.enabled`, `-tls-client-ca-file`, `-tls-require-client-cert`, trusted user headers with `-metrics.trust-proxy-headers`	Lets the proxy require tenant context, optionally require client certs, and attribute read traffic to the actual Grafana user or trusted upstream identity when sensitive metrics export is explicitly enabled.
Proxy -> VictoriaLogs	`-backend-basic-auth`, `-forward-authorization`, `-forward-headers`	Lets the lower layer keep its own auth boundary while the proxy preserves full Loki-client compatibility on the northbound side.
Proxy -> peer cache	`-peer-auth-token`	Prevents peer-cache reuse from becoming an unauthenticated east-west path when the fleet spans a broader network boundary.
Operator -> admin/debug endpoints	`-server.admin-auth-token`	Protects admin and troubleshooting surfaces without weakening the main read path. Non-loopback listeners now require this token before `/debug/queries` or `/debug/pprof` can be enabled.

When trusted proxy headers are enabled, the proxy also forwards derived context headers to VictoriaLogs:

X-Loki-VL-Client-ID
X-Loki-VL-Client-Source

That gives the lower layer better context about who is really behind the read traffic while still preserving datasource compatibility at the Grafana edge.

Runtime and Process Metrics

The proxy also exports a lightweight built-in set of runtime and process/container health metrics. App-scoped aliases are emitted with the loki_vl_proxy_ prefix, while legacy go_* and process_* families remain for compatibility:

Grouped family rows below mean every concrete metric name in that family shares the same cardinality profile.

Metric family	Labels	Cardinality	Description
`loki_vl_proxy_go_memstats_*`, `loki_vl_proxy_go_goroutines`, `loki_vl_proxy_go_gc_cycles_total`, `loki_vl_proxy_go_gc_duration_seconds`	none	`Low`	Go runtime health
`loki_vl_proxy_process_resident_memory_bytes`, `loki_vl_proxy_process_open_fds`	none	`Low`	process resource usage
`loki_vl_proxy_process_cpu_usage_ratio`	`mode`	`Low`	CPU pressure split by `user`, `system`, `iowait`
`loki_vl_proxy_process_memory_*`	none	`Low`	total, free, available, usage ratio
`loki_vl_proxy_process_disk_*_bytes_total`	none	`Low`	disk I/O byte counters
`loki_vl_proxy_process_disk_*_operations_total`	none	`Low`	disk read/write operation counters
`loki_vl_proxy_process_network_*_bytes_total`	none	`Low`	network I/O counters
`loki_vl_proxy_process_pressure_*_{some,full}_ratio`	`window`	`Low`	Linux PSI gauges when available (`10s`, `60s`, `300s`)

Legacy unprefixed compatibility aliases (go_*, process_*) follow the same label sets and cardinality profile as their loki_vl_proxy_* counterparts.

Kubernetes notes:

These runtime/system metrics are read from /proc and do not require Kubernetes RBAC permissions.
PSI metrics (process_pressure_*) depend on kernel support and may be absent on nodes without /proc/pressure/*.
On startup, the proxy logs a system-metrics readiness check with missing families and remediation hints instead of failing silently.
If you mount host /proc (-proc-root=/host/proc), these metrics will reflect host scope; keep default pod /proc for pod/container scope.
For per-pod attribution in OTLP backends, set OTEL_SERVICE_INSTANCE_ID from pod name and OTEL_SERVICE_NAMESPACE from pod namespace (the upstream chart now injects these by default).
CI includes a metric-name guard so new app metrics must stay under the loki_vl_proxy_* prefix unless explicitly allowlisted for compatibility.

PromQL Drilldowns For Slowness And Client Errors

Use these queries to quickly isolate downstream client pain, upstream slowness, and route-specific cache efficiency:

Goal	Query
Downstream p95 latency by route	`histogram_quantile(0.95, sum(rate(loki_vl_proxy_request_duration_seconds_bucket{system="loki",direction="downstream"}[5m])) by (le, endpoint, route))`
Upstream p95 latency by route	`histogram_quantile(0.95, sum(rate(loki_vl_proxy_backend_duration_seconds_bucket{system="vl",direction="upstream"}[5m])) by (le, endpoint, route))`
Downstream 5xx rate by route	`sum(rate(loki_vl_proxy_requests_total{system="loki",direction="downstream",status=~"5.."}[5m])) by (endpoint, route)`
Tenant p99 latency by route	`histogram_quantile(0.99, sum(rate(loki_vl_proxy_tenant_request_duration_seconds_bucket{system="loki",direction="downstream"}[5m])) by (le, tenant, endpoint, route))`
Route cache hit ratio	`sum(rate(loki_vl_proxy_cache_hits_by_endpoint{system="loki",direction="downstream"}[5m])) by (endpoint, route) / clamp_min(sum(rate(loki_vl_proxy_cache_hits_by_endpoint{system="loki",direction="downstream"}[5m])) by (endpoint, route) + sum(rate(loki_vl_proxy_cache_misses_by_endpoint{system="loki",direction="downstream"}[5m])) by (endpoint, route), 1)`
Client bad_request by route	`sum(rate(loki_vl_proxy_client_errors_total{system="loki",direction="downstream",reason="bad_request"}[5m])) by (endpoint, route)`

For latency histograms, keep dashboards on p50, p95, and p99 rather than averages. Averages hide tail latency incidents. For exact proxy-only overhead, use structured logs (proxy.overhead_ms) alongside the latency histograms; subtracting histogram quantiles is not mathematically reliable.

The packaged Loki-VL-Proxy dashboard includes an Operational Resources section with:

memory saturation and memory footprint/headroom
CPU usage split by mode
disk IOPS up/down and disk throughput up/down
network up/down
PSI pressure (cpu/memory/io)
process RSS and open file descriptors by pod

The top of the dashboard is organized as a left-to-right operator flow:

Main Overview - Client -> Proxy -> VictoriaLogs
Client Edge - Request Quality & Shape
Heavy Consumers - Client Load Drivers
Proxy -> VictoriaLogs Query Pipeline

It also includes a Query-Range Windowing section for cache/tuning signals:

window fetch p50/p95 latency
window merge p50/p95 latency
window cache hit ratio
adaptive window parallelism + EWMA latency/error

It also includes a Long-Range Resilience KPIs section for phase tuning:

prefilter kept/skipped rate
retry/degraded-batch/partial-response rate
prefilter hit ratio

Dashboard datasource notes:

datasource variable regex is intentionally permissive (/.*/) so the dashboard works with scrape-backed and OTLP-backed metric datasources without renaming
key stat panels use explicit zero fallbacks so dashboards remain readable during cold starts and low-traffic windows

Active Backend E2E Healthchecks

/ready confirms backend reachability, but production health should also include synthetic end-to-end probes with real query traffic shape.

Recommended pattern:

Probe /ready every 15-30s for hard availability.
Run a lightweight synthetic query_range every 1-5m from inside the cluster.
Alert when synthetic query latency or error ratio breaches SLO even if /ready is green.

This catches backend partial degradation (slow scans, storage pressure, auth drift) earlier than readiness alone.

Choosing Client Identity

Per-client metrics and request logs can use trusted upstream identity instead of only remote IP:

-metrics.trust-proxy-headers=true

When enabled, the proxy prefers:

Trusted user headers (X-Grafana-User, X-Forwarded-User, X-Webauth-User, X-Auth-Request-User)
tenant
trusted forwarded client IP (X-Forwarded-For)
remote IP

Datasource/basic-auth credentials are reported separately under auth.* and are not used as end-user identity. Only enable trusted proxy headers when the proxy sits behind a trusted auth proxy or Grafana instance.

Integration Examples

OpenTelemetry Collector: scrape `/metrics` and export OTLP

receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: loki-vl-proxy
          scrape_interval: 15s
          static_configs:
            - targets: ["loki-vl-proxy:3100"]

processors:
  batch: {}

exporters:
  otlphttp:
    endpoint: https://otel-gateway.example.com
    headers:
      Authorization: Bearer ${OTLP_TOKEN}

service:
  pipelines:
    metrics:
      receivers: [prometheus]
      processors: [batch]
      exporters: [otlphttp]

OpenTelemetry Collector: collect JSON logs from container stdout

receivers:
  filelog:
    include:
      - /var/log/containers/*loki-vl-proxy*.log
    operators:
      - type: json_parser

processors:
  batch: {}

exporters:
  otlphttp:
    endpoint: https://otel-gateway.example.com

service:
  pipelines:
    logs:
      receivers: [filelog]
      processors: [batch]
      exporters: [otlphttp]

Vector: ship structured JSON logs

[sources.proxy_logs]
type = "kubernetes_logs"

[transforms.proxy_json]
type = "remap"
inputs = ["proxy_logs"]
source = '''
. = parse_json!(string!(.message))
'''

[sinks.proxy_otlp]
type = "opentelemetry"
inputs = ["proxy_json"]
protocol.type = "http"
protocol.uri = "https://otel-gateway.example.com/v1/logs"

Fluent Bit: tail container logs and keep JSON structure

[INPUT]
    Name              tail
    Path              /var/log/containers/*loki-vl-proxy*.log
    Parser            docker
    Tag               loki_vl_proxy

[FILTER]
    Name              parser
    Match             loki_vl_proxy
    Key_Name          log
    Parser            json

[OUTPUT]
    Name              opentelemetry
    Match             loki_vl_proxy
    Host              otel-collector
    Port              4318
    Logs_uri          /v1/logs

Recommended Dashboards and Alerts

Start with:

request rate and error rate by endpoint
backend latency p95/p99 by endpoint
cache hit ratio overall and by endpoint
top client by request rate, bytes, and query length
top tenant by request volume and latency
circuit breaker state
process RSS and open file descriptors

Dashboard Catalog

Dashboard	Source	Primary use
`dashboard/loki-vl-proxy.json`	Prometheus metrics	Service health, SLOs, cache and endpoint latency trends

Metrics Dashboard Setup (Scrape and OTLP Push)

The metrics dashboard includes a Datasource variable and works with either metric transport mode:

Prometheus scrape (/metrics + ServiceMonitor)
OTLP push (-otlp-endpoint=...) into a Prometheus-compatible backend

Recommended setup:

Point Datasource to any Prometheus-compatible datasource that contains loki_vl_proxy_* metrics.
For scrape mode, use the datasource fed by your ServiceMonitor/Prometheus scrape pipeline.
For OTLP push mode, use the datasource fed by your OTLP metrics pipeline.
VictoriaMetrics can be used for both modes when it receives both scrape and OTLP streams.

Transport checklist:

Scrape mode:
- -server.register-instrumentation=true
- Helm serviceMonitor.enabled=true
OTLP push mode:
- -otlp-endpoint configured
- -server.register-instrumentation=false (optional, recommended when you want push-only)

Quick validation in Grafana Explore against the selected datasource:

loki_vl_proxy_uptime_seconds

If this query has data, the Loki-VL-Proxy Metrics dashboard should populate out of the box.

High-signal alert ideas:

5xx rate rising on query endpoints
cache hit ratio collapsing
backend latency p95 breaching SLO
a single client dominating bytes or query length
circuit breaker opening repeatedly

The packaged alert set and incident procedures live in:

Notes

OTLP push and Prometheus scrape share the same important proxy metrics and metric names.
The OTLP export is intentionally lightweight and does not pull in the full OpenTelemetry Go SDK.
Structured logs are already safe for JSON ingestion; agents can forward them directly or transform them into OTLP logs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Observability Guide

Observability Endpoints

Logs

JSON Log Shape

Log Sources

OpenTelemetry Fields Used in Logs

Metrics

Export Modes

Prometheus Scrape

OTLP Push

OpenTelemetry Resource Attributes for Metrics and Logs

Request Dimensions

Cardinality Levels

Core Proxy Metrics

Query-Range Windowing Metrics

Patterns Snapshot Metrics

Peer Cache Metrics

Tenant and Client Metrics

Grafana Client Visibility, Offenders, and User Patterns

Northbound and Southbound Auth Boundaries

Runtime and Process Metrics

PromQL Drilldowns For Slowness And Client Errors

Active Backend E2E Healthchecks

Choosing Client Identity

Integration Examples

OpenTelemetry Collector: scrape `/metrics` and export OTLP

OpenTelemetry Collector: collect JSON logs from container stdout

Vector: ship structured JSON logs

Fluent Bit: tail container logs and keep JSON structure

Recommended Dashboards and Alerts

Dashboard Catalog

Metrics Dashboard Setup (Scrape and OTLP Push)

Notes

FilesExpand file tree

observability.md

Latest commit

History

observability.md

File metadata and controls

Observability Guide

Observability Endpoints

Logs

JSON Log Shape

Log Sources

OpenTelemetry Fields Used in Logs

Metrics

Export Modes

Prometheus Scrape

OTLP Push

OpenTelemetry Resource Attributes for Metrics and Logs

Request Dimensions

Cardinality Levels

Core Proxy Metrics

Query-Range Windowing Metrics

Patterns Snapshot Metrics

Peer Cache Metrics

Tenant and Client Metrics

Grafana Client Visibility, Offenders, and User Patterns

Northbound and Southbound Auth Boundaries

Runtime and Process Metrics

PromQL Drilldowns For Slowness And Client Errors

Active Backend E2E Healthchecks

Choosing Client Identity

Integration Examples

OpenTelemetry Collector: scrape /metrics and export OTLP

OpenTelemetry Collector: collect JSON logs from container stdout

Vector: ship structured JSON logs

Fluent Bit: tail container logs and keep JSON structure

Recommended Dashboards and Alerts

Dashboard Catalog

Metrics Dashboard Setup (Scrape and OTLP Push)

Notes

OpenTelemetry Collector: scrape `/metrics` and export OTLP