feat: add structured logs to zenml-server and OTel instrumentation by amitvikramraj · Pull Request #4676 · zenml-io/zenml

amitvikramraj · 2026-04-02T11:55:46Z

Changes I made

Replaced the custom ConsoleFormatter with structlog as a ProcessorFormatter over Python's standard logging.
Server-side modules (zen_server/, zen_stores/) now emit structured events with keyword arguments instead of f-string messages.
Request context (request_id, method, path, etc.) is automatically propagated via structlog contextvars — no manual interpolation needed.
Added OpenTelemetry exporter for logs, traces and metrics.
Also includes a few small improvements:
- X-Request-ID in error responses to easily be able to filter logs
- Added rate limit warning messages in logs

Why

Server logs were unstructured f-strings with manually threaded request IDs across 11+ files.
This made them impossible to query in by tools like Loki/Garfana and painful to maintain.
structlog provides parseable JSON output in the server and colored console output locally, with automatic context propagation.

Before / After

Before:

[d5638eeb] API STATS - GET /api/v1/pipelines AUTHORIZING  [ active_requests: 1 memory_used_mb: 343.77 thread_count: 6 open_fds: 28 fd_limit: 65535 ]
[d5638eeb] SQL STATS - GET /api/v1/pipelines 'SqlZenStore.list_runs' STARTED  [ active_connections: 0 idle_connections: 3 overflow_connections: -17 ]
[d5638eeb] SQL STATS - GET /api/v1/pipelines 'SqlZenStore.list_runs' COMPLETED in 12.34ms  [ active_connections: 0 idle_connections: 3 overflow_connections: -17 ]
[d5638eeb] API STATS - 200 GET /api/v1/pipelines took 38.42ms  [ active_requests: 1 memory_used_mb: 343.77 thread_count: 6 ]

After — console:

2026-04-02T09:21:52Z [debug    ] request.received               [zenml.zen_server.middleware] client_ip=172.217.26.123 method=GET path=/api/v1/pipelines request_id=d5638eeb
2026-04-02T09:21:52Z [debug    ] sql.session.started            [zenml.zen_stores.sql_zen_store] active_connections=0 caller=SqlZenStore.list_runs idle_connections=3 request_id=d5638eeb
2026-04-02T09:21:52Z [debug    ] sql.session.completed          [zenml.zen_stores.sql_zen_store] caller=SqlZenStore.list_runs duration_ms=12.34 error=false request_id=d5638eeb
2026-04-02T09:21:52Z [debug    ] request.completed              [zenml.zen_server.middleware] duration_ms=38.42ms method=GET path=/api/v1/pipelines request_id=d5638eeb status_code=200

After — JSON (server default, queryable in Loki/Grafana):

{"event":"request.received","level":"debug","logger":"zenml.zen_server.middleware","method":"GET","path":"/api/v1/pipelines","request_id":"d5638eeb","client_ip":"172.217.26.123","timestamp":"2026-04-02T09:21:52Z"}
{"event":"sql.session.started","level":"debug","logger":"zenml.zen_stores.sql_zen_store","caller":"SqlZenStore.list_runs","active_connections":0,"idle_connections":3,"request_id":"d5638eeb","timestamp":"2026-04-02T09:21:52Z"}
{"event":"sql.session.completed","level":"debug","logger":"zenml.zen_stores.sql_zen_store","caller":"SqlZenStore.list_runs","duration_ms":12.34,"error":false,"request_id":"d5638eeb","timestamp":"2026-04-02T09:21:52Z"}
{"event":"request.completed","level":"debug","logger":"zenml.zen_server.middleware","method":"GET","path":"/api/v1/pipelines","status_code":200,"duration_ms":"38.42ms","request_id":"d5638eeb","timestamp":"2026-04-02T09:21:52Z"}

Steps to reproduce:

Run docker compose up --build — ZenML UI starts on port 3001, Grafana UI runs on port 3002
Open Grafana → Loki → Verify the logs

TODO: Later remove the docker-compose.

socket-security · 2026-04-02T11:57:59Z

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff	Package	Supply Chain Security	Vulnerability	Quality	Maintenance	License
	opentelemetry-instrumentation-requests@0.62b0
	opentelemetry-exporter-otlp-proto-http@1.41.0
	opentelemetry-instrumentation-logging@0.62b0
	opentelemetry-instrumentation-fastapi@0.62b0
	opentelemetry-distro@0.62b0
	opentelemetry-instrumentation-sqlalchemy@0.62b0

View full report

Json-Andriopoulos · 2026-04-08T09:19:51Z

 )

-logger = get_logger(__name__)
+logger = structlog.get_logger(__name__)


We shouldn't do that. We should configure the logging module at root level, then any logger we instantiate with the standard logging.getLogger(__name__) instruction has the formatter changes in place.

Something like that:

import logging import structlog shared = [ structlog.stdlib.add_log_level, structlog.processors.TimeStamper(fmt="iso"), ] formatter = structlog.stdlib.ProcessorFormatter( foreign_pre_chain=shared, processors=[ structlog.stdlib.ProcessorFormatter.remove_processors_meta, structlog.processors.JSONRenderer(), ], ) handler = logging.StreamHandler() handler.setFormatter(formatter) root = logging.getLogger() root.handlers = [handler] root.setLevel(logging.INFO) logger = logging.getLogger(__name__) logger.info("hello from stdlib logging")

That doesn't propagate changes regarding logging across our codebase but does the work we want (structured, JSON logs).

But if we call the std logger, then we would not be able to pass extra context.

logger = logging.getLogger(__name__) logger.info("hello from stdlib logging", ctx="something", another_ctx="something_else") # ^^^this would not work

You can't pass them this way sure, but you can use other structlog utility like bind_context_vars to do so. I imagine we wouldn't pass context vars this way either. A more sensible way to do it is:

with context_vars_set(): # -> Some context manager that sets context vars on the logger do_something() logger.info() try: do() except: logger.error() finally: logger.debug()

With this pattern all log calls pick up the context here, no reason to pass them explicitly.

Here is how I would approach migration to structlog. This is up for discussion btw, don't want to enforce it as a pattern:

I would keep on using the same log statements, keep on using python's native logger.

I would use structlog to re-configure loggers (JSO formatter etc.)

I would introduce a few utilities (context manager, fastapi middleware, etc.) that help me introduce context vars on a few, controlled places that inject context for all subsequent log records (e.g trace-id on the beginning of any API call).

What that gives us: We get JSONified logs, we get context that is necessary for debugging (e.g. track all log statements for a particular request) and we do it with limited changes. I am a bit sceptic if I want to introduce changes across the codebase to comply with structlog's logger.

The difference boils down here:

logger.info("hello from stdlib logging", ctx="something", another_ctx="something_else")

I disagree that we should do it. We would need to change hundreds of log statements and the benefit is minimal. If you have context specific to one log statement then just add it in the message as you would today. If not, if the context is shared do it with context manager and all log statements within that block will have the context variables:

with bind_context(ctx_var="test"): logger.info("Hello") do() logger.warning("Oh oh")

github-actions · 2026-04-09T07:57:43Z

Documentation Link Check Results

❌ Absolute links check failed
There are broken absolute links in the documentation. See workflow logs for details
❌ Relative links check failed
There are broken relative links in the documentation. See workflow logs for details
_{Last checked: 2026-04-23 09:08:41 UTC}

stefannica

I really love where this PR is going. You have successfully replaced the entire custom API request tracing mess with centralized structured logging. Some things to consider:

test that the request trace logs are still generated and have all the necessary info. The context vars you used need to travel from async-io coroutines down to worker threads and the connection you made through structlog might fail to cover everything.
is there a way to hide structlogs from the actual code that emits logs ? I want it to be an implementation detail rather than something that the regular python modules need to be aware of and import directly. The key lies in reusing the existing centralized logging module zenml.logging.

stefannica · 2026-04-13T09:05:29Z

    CredentialsNotValid,
    OAuthError,
 )
-from zenml.logger import get_logger


Currently, all ZenML code uses the zenml.logger module and utilities. You can use that to your advantage: instead of replacing this with structlog in various modules, you should go to the source and update the zenml.logger functionality to use structlog instead when configured, which means that you automatically get structured logs everywhere instead of just a handful of modules.

Keep in mind that the same logger module is also used by client code, where you still want to keep unstructured logs as the default.

Thanks for this, it's a good idea. Earlier, I purposely didn't touch it for the same reason, since it is being used on the client side as well.

addressed this

stefannica · 2026-04-13T09:14:04Z

+        )
+        return
+
+    service_name = os.environ.get("OTEL_SERVICE_NAME", "zenml-server")


it's not a good idea to have all these environment variables hidden inside the code here. it's best to add them to the ServerConfiguration in zenml.config.server.config where they are centralized, automatically set from environment and discover-able by users.

addressed it!

stefannica · 2026-04-13T09:19:13Z

    "tldextract~=5.1.0",
    "itsdangerous~=2.2.0",
    "croniter>=6.0.0",
+    "opentelemetry-distro>=0.59b0",


we already have OTEL dependencies in our client. Please check that they match version-wise.

Could I get a link to where that dependency is? Is it the sdk dependency here at line 41 in toml file "opentelemetry-sdk==1.38.0" ?

github-actions · 2026-04-19T05:49:30Z

🔍 Broken Links Report

Summary

📁 Files with broken links: 2
🔗 Total broken links: 4
📄 Broken markdown links: 4
🖼️ Broken image links: 0
⚠️ Broken reference placeholders: 0

Details

File	Link Type	Link Text	Broken Path
`zenml-pro/resource-pools-core-concepts.md`	📄	"How preemption works"	`resource-pools-preemption.md`
`zenml-pro/resource-pools-examples.md`	📄	"How preemption works"	`resource-pools-preemption.md`
`zenml-pro/resource-pools-examples.md`	📄	"How preemption works"	`resource-pools-preemption.md`
`zenml-pro/resource-pools-examples.md`	📄	"How preemption works"	`resource-pools-preemption.md`

📂 Full file paths

/home/runner/work/zenml/zenml/scripts/../docs/book/getting-started/zenml-pro/resource-pools-core-concepts.md
/home/runner/work/zenml/zenml/scripts/../docs/book/getting-started/zenml-pro/resource-pools-examples.md
/home/runner/work/zenml/zenml/scripts/../docs/book/getting-started/zenml-pro/resource-pools-examples.md
/home/runner/work/zenml/zenml/scripts/../docs/book/getting-started/zenml-pro/resource-pools-examples.md

…ormatter

…ate server logs

…s in OTel

…guration`

…tion and reverted back the usage of ger_logger function in zen_server modules. And added comments for structlog settings

…rn logs via OTel

… logs appeared and added a function to sanitize log records for OTel compatibility in OTel Log Store

…class

amitvikramraj linked an issue Apr 2, 2026 that may be closed by this pull request

Add structured log #4675

Open

1 task

Json-Andriopoulos reviewed Apr 2, 2026

View reviewed changes

Comment thread docker/zenml-server-dev.Dockerfile Outdated

amitvikramraj marked this pull request as ready for review April 7, 2026 09:37

Json-Andriopoulos reviewed Apr 8, 2026

View reviewed changes

stefannica requested changes Apr 13, 2026

View reviewed changes

amitvikramraj force-pushed the feat/otel branch 2 times, most recently from 087c7a7 to dced3e4 Compare April 19, 2026 05:49

amitvikramraj changed the title ~~Add structured logs to zenml-server using structlog~~ feat: add structured logs to zenml-server and OTel instrumentation Apr 19, 2026

amitvikramraj added 19 commits April 23, 2026 07:31

feat: docker-compose setup for OTel auto-instrumentation

6ecf746

chore: added uv.lock in gitignore

ae92d4a

feat: added structlog as dependency

97dad48

refactor(logging): replace ConsoleFormatter with structlog ProcessorF…

1119430

…ormatter

feat: propagate request context via structlog contextvars in middleware

3d475fa

refactor: converted server modules to structured event logging

fc46d4e

refactor: removed get_system_metrics_log_str from utils

62156ab

feat: include X-Request-ID header in error responses to easily correl…

41bbbb1

…ate server logs

fix: added sanitize filter to prevent LogRecord serialization warning…

b06889a

…s in OTel

feat: added DB ping to /ready endpoint

49210d4

linting changes

6ae5e16

linting changes

01a0c59

chore: moved otel deps to server dependency group

61979f5

chore: undoing db check in ready endpoint

a5a58dc

renamed variable in _error_response handling function

39452a8

chore: undoing formatting changes

6cc913a

feat: added OpenTelemetry config to zenml_server config, `ServerConfi…

4e44341

…guration`

feat: replaced stdlib logger with structlog logger in get_logger func…

790e0a9

…tion and reverted back the usage of ger_logger function in zen_server modules. And added comments for structlog settings

feat: configured uvicorn loggers to use structlog and propagate uvico…

9e48f57

…rn logs via OTel

feat: better colored logs, grey out debug logs, fixed how client side…

eed1b16

… logs appeared and added a function to sanitize log records for OTel compatibility in OTel Log Store

amitvikramraj force-pushed the feat/otel branch from dced3e4 to eed1b16 Compare April 23, 2026 02:02

amitvikramraj added internal To filter out internal PRs and issues backend Issues that require changes on the backend core-team Issues that are being handled by the core team labels Apr 23, 2026

amitvikramraj added 2 commits April 23, 2026 12:38

refactor: moved color definitions and patterns into _ConsoleRenderer …

0fe5968

…class

updated docstings

4c3fc2d

Conversation

amitvikramraj commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes I made

Why

Before / After

Steps to reproduce:

Uh oh!

socket-security Bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amitvikramraj Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Json-Andriopoulos Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Documentation Link Check Results

Uh oh!

stefannica left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amitvikramraj Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amitvikramraj Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔍 Broken Links Report

Summary

Details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

amitvikramraj commented Apr 2, 2026 •

edited

Loading

socket-security Bot commented Apr 2, 2026 •

edited

Loading

amitvikramraj Apr 9, 2026 •

edited

Loading

Json-Andriopoulos Apr 23, 2026 •

edited

Loading

github-actions Bot commented Apr 9, 2026 •

edited

Loading

amitvikramraj Apr 16, 2026 •

edited

Loading

amitvikramraj Apr 15, 2026 •

edited

Loading

github-actions Bot commented Apr 19, 2026 •

edited

Loading