Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3185 +/- ##
==========================================
- Coverage 81.09% 81.08% -0.01%
==========================================
Files 318 318
Lines 73398 73398
==========================================
- Hits 59519 59513 -6
- Misses 13323 13329 +6
Partials 556 556 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| - Receives Request. | ||
| - Extracts `T1-S3`. | ||
| - Starts Span `S4` (Parent: `S3`). runs user code... Ends `S4`. | ||
| - **Crucial**: SDK sends Span `S4` data to Collector. |
There was a problem hiding this comment.
Expecting the sdk to handle data sending?
There was a problem hiding this comment.
Yes, thought of both core and sdk to send data
Trace X:
Core Span: Vertex Processing
SDK will start Span for: UDF Execution
User starts Span (optional): some calculation
For sdk/user spans to work, user has to initiate tracing(we can provide a helper fn via sdk to set up global tracer for udf process with same OTEL endpoint env variable)
| ## 2. Key Architectural Decisions | ||
|
|
||
| ### 2.1 Format: W3C Trace Context | ||
| We will use the **W3C Trace Context** standard (`traceparent`, `tracestate`) for context propagation. |
| - **Key**: `"tracing"` | ||
| - **Value**: A `KeyValueGroup` containing the W3C headers. | ||
|
|
||
| **Message Structure:** |
There was a problem hiding this comment.
we do not need to add anything to our proto spec, right? since this is already in place
| - **Implementation**: The Core injects the **same** current span context (parent) into `sys_metadata` for **every** output message. | ||
| - **Result**: Downstream vertices will create N separate spans, all pointing back to the same parent span from the Map vertex. | ||
|
|
||
| ## 4. Implementation Plan: Numaflow SDKs (e.g., Go/Rust/Python) |
There was a problem hiding this comment.
let's not worry about SDKs, we just need server spans. Adding dependencies on SDK might be an overkill.
|
I applaude the objective of this proposal 👍🏼Great synergy with #2645. |
|
@adarsh0728 let's work on this PR itself since there are lot of eyes on this 😄 |
|
|
||
| 1. **Source**: | ||
| - Generates data. | ||
| - Starts Trace `T1`, Span `S1`. |
There was a problem hiding this comment.
Do we create the trace id in SDK after receiving messages from UDF ? Only then we can propagate traceparent coming from external systems like Kafka right?
| - `Consumer`: Describes a child of an asynchronous request (e.g., receiving from a queue). | ||
| - `Internal`: Default. Represents an internal operation within an application. | ||
| 4. **Process**: | ||
| - **Before UDF**: Create a child span (e.g., `kind=Client`) for the UDF call. Inject this context into a *copy* of the metadata passed to the UDF. |
There was a problem hiding this comment.
Similarly, should we set kind=Producer and kind=Consumer when writing to ISB and reading from ISB ?
| 4. **Process**: | ||
| - **Before UDF**: Create a child span (e.g., `kind=Client`) for the UDF call. Inject this context into a *copy* of the metadata passed to the UDF. | ||
| - **After UDF**: Receive results. | ||
| 5. **Write**: |
There was a problem hiding this comment.
I think handling traces in Sink should be a separate section.
We could add more details for retries, dropped etc. Eg:
// Record a Span Event with details
span.add_event("message.dropped", vec![
KeyValue::new("drop.reason", "fallback_strategy"),
KeyValue::new("drop.after_retries", 5),
]);Tracing backend will see something like:
{
"trace_id": "4bf92f...",
"span_id": "aabb...",
"name": "numaflow.my-pipeline.sink-vertex.process",
"events": [
{
"name": "message.dropped",
"timestamp": "2026-04-01T12:00:00Z",
"attributes": {
"drop.reason": "fallback_strategy",
"drop.after_retries": 5
}
}
]
}| - Extracts context `T1-S1`. | ||
| - Starts Span `S2` (Parent: `S1`) -> **"Vertex Processing"**. | ||
| - Covers overhead (ISB read/write, serialization). | ||
| - Prepares UDF Request. Starts Span `S3` (Parent: `S2`) -> **"UDF RPC Call"**. |
There was a problem hiding this comment.
Let's use predictable span names? eg
numaflow.{pipeline}.{vertex}.process
numaflow.{pipeline}.{vertex}.udf
numaflow.{pipeline}.{vertex}.sink.retry
| - **Limit**: To prevent span bloat in massive windows, limit the number of links (e.g., max 50). If more, sample representative links (e.g., first 10, last 10). | ||
|
|
||
| ### 3.6 Handling FlatMap (Fan-Out) | ||
| When a Map vertex produces multiple output messages from a single input (1 -> N): |
There was a problem hiding this comment.
How does this work with batchmap (N inputs to N outputs in one grpc invocation) ?
|
I think we should User Metadata for Trace Propagation When UDF is Instrumented If a UDF is instrumented, it should inject its active span context into outgoing user metadata (e.g. Order would be:
|
Signed-off-by: adarsh0728 <gooneriitk@gmail.com>
Proposal: OpenTelemetry Tracing Design for Numaflow