Skip to content

feat: add structured logs to zenml-server and OTel instrumentation#4676

Open
amitvikramraj wants to merge 22 commits intodevelopfrom
feat/otel
Open

feat: add structured logs to zenml-server and OTel instrumentation#4676
amitvikramraj wants to merge 22 commits intodevelopfrom
feat/otel

Conversation

@amitvikramraj
Copy link
Copy Markdown
Contributor

@amitvikramraj amitvikramraj commented Apr 2, 2026

Changes I made

  • Replaced the custom ConsoleFormatter with structlog as a ProcessorFormatter over Python's standard logging.
  • Server-side modules (zen_server/, zen_stores/) now emit structured events with keyword arguments instead of f-string messages.
  • Request context (request_id, method, path, etc.) is automatically propagated via structlog contextvars — no manual interpolation needed.
  • Added OpenTelemetry exporter for logs, traces and metrics.
  • Also includes a few small improvements:
    • X-Request-ID in error responses to easily be able to filter logs
    • Added rate limit warning messages in logs

Why

  • Server logs were unstructured f-strings with manually threaded request IDs across 11+ files.
  • This made them impossible to query in by tools like Loki/Garfana and painful to maintain.
  • structlog provides parseable JSON output in the server and colored console output locally, with automatic context propagation.

Before / After

Before:

[d5638eeb] API STATS - GET /api/v1/pipelines AUTHORIZING  [ active_requests: 1 memory_used_mb: 343.77 thread_count: 6 open_fds: 28 fd_limit: 65535 ]
[d5638eeb] SQL STATS - GET /api/v1/pipelines 'SqlZenStore.list_runs' STARTED  [ active_connections: 0 idle_connections: 3 overflow_connections: -17 ]
[d5638eeb] SQL STATS - GET /api/v1/pipelines 'SqlZenStore.list_runs' COMPLETED in 12.34ms  [ active_connections: 0 idle_connections: 3 overflow_connections: -17 ]
[d5638eeb] API STATS - 200 GET /api/v1/pipelines took 38.42ms  [ active_requests: 1 memory_used_mb: 343.77 thread_count: 6 ]

After — console:

2026-04-02T09:21:52Z [debug    ] request.received               [zenml.zen_server.middleware] client_ip=172.217.26.123 method=GET path=/api/v1/pipelines request_id=d5638eeb
2026-04-02T09:21:52Z [debug    ] sql.session.started            [zenml.zen_stores.sql_zen_store] active_connections=0 caller=SqlZenStore.list_runs idle_connections=3 request_id=d5638eeb
2026-04-02T09:21:52Z [debug    ] sql.session.completed          [zenml.zen_stores.sql_zen_store] caller=SqlZenStore.list_runs duration_ms=12.34 error=false request_id=d5638eeb
2026-04-02T09:21:52Z [debug    ] request.completed              [zenml.zen_server.middleware] duration_ms=38.42ms method=GET path=/api/v1/pipelines request_id=d5638eeb status_code=200

After — JSON (server default, queryable in Loki/Grafana):

{"event":"request.received","level":"debug","logger":"zenml.zen_server.middleware","method":"GET","path":"/api/v1/pipelines","request_id":"d5638eeb","client_ip":"172.217.26.123","timestamp":"2026-04-02T09:21:52Z"}
{"event":"sql.session.started","level":"debug","logger":"zenml.zen_stores.sql_zen_store","caller":"SqlZenStore.list_runs","active_connections":0,"idle_connections":3,"request_id":"d5638eeb","timestamp":"2026-04-02T09:21:52Z"}
{"event":"sql.session.completed","level":"debug","logger":"zenml.zen_stores.sql_zen_store","caller":"SqlZenStore.list_runs","duration_ms":12.34,"error":false,"request_id":"d5638eeb","timestamp":"2026-04-02T09:21:52Z"}
{"event":"request.completed","level":"debug","logger":"zenml.zen_server.middleware","method":"GET","path":"/api/v1/pipelines","status_code":200,"duration_ms":"38.42ms","request_id":"d5638eeb","timestamp":"2026-04-02T09:21:52Z"}

Steps to reproduce:

  1. Run docker compose up --build — ZenML UI starts on port 3001, Grafana UI runs on port 3002
  2. Open Grafana → Loki → Verify the logs

TODO: Later remove the docker-compose.

@socket-security
Copy link
Copy Markdown

socket-security Bot commented Apr 2, 2026

@amitvikramraj amitvikramraj linked an issue Apr 2, 2026 that may be closed by this pull request
1 task
Comment thread docker/zenml-server-dev.Dockerfile Outdated
@amitvikramraj amitvikramraj marked this pull request as ready for review April 7, 2026 09:37
Comment thread src/zenml/zen_server/auth.py Outdated
)

logger = get_logger(__name__)
logger = structlog.get_logger(__name__)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't do that. We should configure the logging module at root level, then any logger we instantiate with the standard logging.getLogger(__name__) instruction has the formatter changes in place.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like that:

import logging
import structlog

shared = [
    structlog.stdlib.add_log_level,
    structlog.processors.TimeStamper(fmt="iso"),
]

formatter = structlog.stdlib.ProcessorFormatter(
    foreign_pre_chain=shared,
    processors=[
        structlog.stdlib.ProcessorFormatter.remove_processors_meta,
        structlog.processors.JSONRenderer(),
    ],
)

handler = logging.StreamHandler()
handler.setFormatter(formatter)

root = logging.getLogger()
root.handlers = [handler]
root.setLevel(logging.INFO)

logger = logging.getLogger(__name__)
logger.info("hello from stdlib logging")

That doesn't propagate changes regarding logging across our codebase but does the work we want (structured, JSON logs).

Copy link
Copy Markdown
Contributor Author

@amitvikramraj amitvikramraj Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if we call the std logger, then we would not be able to pass extra context.

logger = logging.getLogger(__name__)
logger.info("hello from stdlib logging", ctx="something", another_ctx="something_else")
# ^^^this would not work

Copy link
Copy Markdown
Contributor

@Json-Andriopoulos Json-Andriopoulos Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't pass them this way sure, but you can use other structlog utility like bind_context_vars to do so. I imagine we wouldn't pass context vars this way either. A more sensible way to do it is:

with context_vars_set():  # -> Some context manager that sets context vars on the logger
    do_something()
    logger.info()
    try:
        do()
    except:
        logger.error()
    finally:
        logger.debug()

With this pattern all log calls pick up the context here, no reason to pass them explicitly.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is how I would approach migration to structlog. This is up for discussion btw, don't want to enforce it as a pattern:

  1. I would keep on using the same log statements, keep on using python's native logger.
  2. I would use structlog to re-configure loggers (JSO formatter etc.)
  3. I would introduce a few utilities (context manager, fastapi middleware, etc.) that help me introduce context vars on a few, controlled places that inject context for all subsequent log records (e.g trace-id on the beginning of any API call).

What that gives us: We get JSONified logs, we get context that is necessary for debugging (e.g. track all log statements for a particular request) and we do it with limited changes. I am a bit sceptic if I want to introduce changes across the codebase to comply with structlog's logger.

The difference boils down here:

logger.info("hello from stdlib logging", ctx="something", another_ctx="something_else")

I disagree that we should do it. We would need to change hundreds of log statements and the benefit is minimal. If you have context specific to one log statement then just add it in the message as you would today. If not, if the context is shared do it with context manager and all log statements within that block will have the context variables:

with bind_context(ctx_var="test"):
    logger.info("Hello")
    do()
    logger.warning("Oh oh")

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 9, 2026

Documentation Link Check Results

Absolute links check failed
There are broken absolute links in the documentation. See workflow logs for details
Relative links check failed
There are broken relative links in the documentation. See workflow logs for details
Last checked: 2026-04-23 09:08:41 UTC

Copy link
Copy Markdown
Contributor

@stefannica stefannica left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really love where this PR is going. You have successfully replaced the entire custom API request tracing mess with centralized structured logging. Some things to consider:

  • test that the request trace logs are still generated and have all the necessary info. The context vars you used need to travel from async-io coroutines down to worker threads and the connection you made through structlog might fail to cover everything.
  • is there a way to hide structlogs from the actual code that emits logs ? I want it to be an implementation detail rather than something that the regular python modules need to be aware of and import directly. The key lies in reusing the existing centralized logging module zenml.logging.

Comment thread docker/zenml-server-dev.Dockerfile Outdated
Comment thread docker/zenml-server-dev.Dockerfile Outdated
Comment thread docker/zenml-server-dev.Dockerfile Outdated
Comment thread docker/zenml-server-dev.Dockerfile Outdated
CredentialsNotValid,
OAuthError,
)
from zenml.logger import get_logger
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, all ZenML code uses the zenml.logger module and utilities. You can use that to your advantage: instead of replacing this with structlog in various modules, you should go to the source and update the zenml.logger functionality to use structlog instead when configured, which means that you automatically get structured logs everywhere instead of just a handful of modules.

Keep in mind that the same logger module is also used by client code, where you still want to keep unstructured logs as the default.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, it's a good idea. Earlier, I purposely didn't touch it for the same reason, since it is being used on the client side as well.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed this

Comment thread src/zenml/zen_server/otel.py Outdated
)
return

service_name = os.environ.get("OTEL_SERVICE_NAME", "zenml-server")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not a good idea to have all these environment variables hidden inside the code here. it's best to add them to the ServerConfiguration in zenml.config.server.config where they are centralized, automatically set from environment and discover-able by users.

Copy link
Copy Markdown
Contributor Author

@amitvikramraj amitvikramraj Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed it!

Comment thread pyproject.toml
"tldextract~=5.1.0",
"itsdangerous~=2.2.0",
"croniter>=6.0.0",
"opentelemetry-distro>=0.59b0",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already have OTEL dependencies in our client. Please check that they match version-wise.

Copy link
Copy Markdown
Contributor Author

@amitvikramraj amitvikramraj Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could I get a link to where that dependency is? Is it the sdk dependency here at line 41 in toml file "opentelemetry-sdk==1.38.0" ?

@amitvikramraj amitvikramraj force-pushed the feat/otel branch 2 times, most recently from 087c7a7 to dced3e4 Compare April 19, 2026 05:49
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 19, 2026

🔍 Broken Links Report

Summary

  • 📁 Files with broken links: 2
  • 🔗 Total broken links: 4
  • 📄 Broken markdown links: 4
  • 🖼️ Broken image links: 0
  • ⚠️ Broken reference placeholders: 0

Details

File Link Type Link Text Broken Path
zenml-pro/resource-pools-core-concepts.md 📄 "How preemption works" resource-pools-preemption.md
zenml-pro/resource-pools-examples.md 📄 "How preemption works" resource-pools-preemption.md
zenml-pro/resource-pools-examples.md 📄 "How preemption works" resource-pools-preemption.md
zenml-pro/resource-pools-examples.md 📄 "How preemption works" resource-pools-preemption.md
📂 Full file paths
  • /home/runner/work/zenml/zenml/scripts/../docs/book/getting-started/zenml-pro/resource-pools-core-concepts.md
  • /home/runner/work/zenml/zenml/scripts/../docs/book/getting-started/zenml-pro/resource-pools-examples.md
  • /home/runner/work/zenml/zenml/scripts/../docs/book/getting-started/zenml-pro/resource-pools-examples.md
  • /home/runner/work/zenml/zenml/scripts/../docs/book/getting-started/zenml-pro/resource-pools-examples.md

@amitvikramraj amitvikramraj changed the title Add structured logs to zenml-server using structlog feat: add structured logs to zenml-server and OTel instrumentation Apr 19, 2026
… logs appeared and added a function to sanitize log records for OTel compatibility in OTel Log Store
@amitvikramraj amitvikramraj added internal To filter out internal PRs and issues backend Issues that require changes on the backend core-team Issues that are being handled by the core team labels Apr 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend Issues that require changes on the backend core-team Issues that are being handled by the core team internal To filter out internal PRs and issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add structured log

3 participants