feat(encoder): add Qwen3-VL vision encoder modeling#729
feat(encoder): add Qwen3-VL vision encoder modeling#729sungwook-son wants to merge 2 commits intoai-dynamo:mainfrom
Conversation
WalkthroughThis pull request adds comprehensive Vision-Language (VL) model support by introducing a vision encoder infrastructure with configurable Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Warning Review ran into problems🔥 ProblemsTimed out fetching pipeline failures after 30000ms Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
collector/trtllm/collect_attn.py (1)
298-316:⚠️ Potential issue | 🟡 MinorUse
hin the INT32 overflow guard to avoid under-collectinghead_dim=72cases.At Line 315, the guard still uses a fixed
128, so newly added72configs can be skipped unnecessarily.Suggested fix
- if b * s * num_kv_heads * 128 * 2 >= 2147483647: + if b * s * num_kv_heads * h * 2 >= 2147483647: continue🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@collector/trtllm/collect_attn.py` around lines 298 - 316, The INT32 overflow check currently multiplies by a hard-coded 128 which ignores the loop variable head size and causes configurations with head_dim=72 to be skipped; update the overflow guard in the nested loops (the block iterating head_dim, n, s, b, n_kv and the check that currently reads something like b * s * num_kv_heads * 128 * 2 >= 2147483647) to use the loop variable h instead of the constant 128 (i.e., use b * s * num_kv_heads * h * 2 >= 2147483647) so all head_dim values are evaluated correctly.src/aiconfigurator/webapp/events/event_fn.py (1)
230-236:⚠️ Potential issue | 🟡 MinorKeep the error message on the summary output.
After the new encoder box was inserted, output
#1is the summary and output#2is the encoder breakdown, but the error tuple still puts"ERROR!!!"in slot 2. On failures the summary box stays blank and the encoder box shows the error instead.Suggested fix
if is_error: return ( - gr.update(value=""), - gr.update(value="ERROR!!!"), + gr.update(value="ERROR!!!"), + gr.update(value=""), gr.update(value=""), gr.update(value=""), gr.update(value=record_df), gr.update(value=stdout_text + stderr_text + traceback_log), )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/aiconfigurator/webapp/events/event_fn.py` around lines 230 - 236, The tuple of gr.update outputs in the handler puts the error message ("ERROR!!!") into the second output (encoder) rather than the first (summary); update the return tuple in the function in event_fn.py so that the error string (stdout_text + stderr_text + traceback_log or "ERROR!!!") is assigned to the first gr.update (summary) and the encoder breakdown gr.update receives the appropriate encoder value (or empty string), i.e., swap/reorder the gr.update(...) entries so summary output is populated with the error and the encoder output is the second element.src/aiconfigurator/sdk/inference_summary.py (1)
248-266:⚠️ Potential issue | 🟡 MinorDocstring copy-paste error: says "context" but should say "encoder".
Both
set_encoder_power_avgandget_encoder_power_avghave docstrings that incorrectly say "context phase average power" instead of "encoder phase average power".📝 Proposed fix
def set_encoder_power_avg(self, power_avg: float) -> None: - """Set context phase average power (watts).""" + """Set encoder phase average power (watts).""" self._encoder_power_avg = power_avg # ... other methods ... def get_encoder_power_avg(self) -> float: - """Get context phase average power (watts).""" + """Get encoder phase average power (watts).""" return self._encoder_power_avg🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/aiconfigurator/sdk/inference_summary.py` around lines 248 - 266, Docstrings for set_encoder_power_avg and get_encoder_power_avg incorrectly say "context phase average power"; update both docstrings to read "encoder phase average power (watts)". Locate the methods set_encoder_power_avg and get_encoder_power_avg in inference_summary.py and change their docstring text to accurately reference the encoder phase so it matches the method purpose.src/aiconfigurator/sdk/backends/base_backend.py (1)
296-319:⚠️ Potential issue | 🟠 MajorAccount for image context tokens in memory calculation for VL models.
The context phase processes
isl + img_ctx_tokenstokens, but_get_memory_usageis called with the originalisl. Since activation memory scales with token count (activations = 2 * num_tokens * h * constants), the memory estimate underestimates the activation footprint during context processing. Update memory calls instatic_ctx,static_gen, andstaticmodes to pass the augmented sequence length (or adjustnum_tokensparameter explicitly).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/aiconfigurator/sdk/backends/base_backend.py` around lines 296 - 319, The memory calculation underestimates activation usage for vision-language models because _get_memory_usage is called with isl instead of the augmented token count that includes img_ctx_tokens; update calls in the "static_ctx", "static_gen", and "static" branches so _get_memory_usage receives isl + img_ctx_tokens (or explicitly set num_tokens to include img_ctx_tokens, e.g., num_tokens=(isl + img_ctx_tokens) or num_tokens=batch_size*beam_width where appropriate), ensuring any prefix/osl arguments remain unchanged; reference the functions/variables _get_memory_usage, _run_encoder (img_ctx_tokens), and the mode branches ("static_ctx", "static_gen", "static") to locate and modify the calls.
🧹 Nitpick comments (8)
collector/vllm/collect_attn.py (1)
517-521: Move SM100 GQA-ratio validation to runtime error handling instead of preventive test case filtering.The
continuestatement at line 520 pre-filters test cases during generation, reducing coverage and masking potential compatibility changes. The collector framework is designed to catch version-incompatible test cases at runtime (non-fatally) rather than preventively gate them during test case construction. Let the kernel fail at runtime and be caught by the collection error handler.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@collector/vllm/collect_attn.py` around lines 517 - 521, The current preventive filter in collect_attn.py uses get_sm_version() and the n // n_kv > 16 check with a continue to skip test cases on SM100; remove that early gating so test cases are generated regardless and let runtime errors be handled by the collector's error handler (i.e., delete the conditional continue that references get_sm_version(), n and n_kv), ensuring we do not pre-filter cases based on GQA ratio and allowing trtllm_batch_decode_with_kv_cache failures to be caught non-fatally at collection time.src/aiconfigurator/sdk/task.py (1)
1334-1335: Drop placeholder encoder locals until they carry real values.These variables are always
Noneand only forwarded, which adds noise inrun_disagg. Prefer passing real config-derived values or omitting these kwargs for now.Also applies to: 1369-1370
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/aiconfigurator/sdk/task.py` around lines 1334 - 1335, Remove the always-None placeholder locals encoder_database and encoder_backend_name and stop forwarding them into run_disagg unless they carry real config values; update the code that sets/defines these locals (encoder_database, encoder_backend_name) to either derive actual values from the configuration or omit the keyword args when calling run_disagg, and change any call sites that currently pass encoder_database=encoder_database or encoder_backend_name=encoder_backend_name to conditionally include those kwargs only if not None (or remove them entirely) so run_disagg only receives meaningful encoder-related arguments.src/aiconfigurator/sdk/backends/trtllm_backend.py (1)
592-596: Encoder weights are added after PP normalization.Line 592 normalizes to per-GPU weight memory, but Lines 594-595 add unnormalized encoder weights. This can inflate memory estimates for pipeline-parallel setups and bias OOM checks.
♻️ Proposed fix
- # count weights on a single GPU - weights /= model.config.pp_size - - for op in model.encoder_ops: - weights += op.get_weights() + for op in model.encoder_ops: + weights += op.get_weights() + + # count weights on a single GPU + weights /= model.config.pp_size🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/aiconfigurator/sdk/backends/trtllm_backend.py` around lines 592 - 596, The current calculation divides weights by model.config.pp_size and then adds unnormalized encoder weights from model.encoder_ops (op.get_weights()), which yields incorrect per-GPU estimates; fix by ensuring encoder op weights are normalized by the same pipeline-parallel factor—either add encoder weights to weights before dividing by model.config.pp_size, or divide each op.get_weights() by model.config.pp_size when adding (refer to the variables/functions weights, model.config.pp_size, model.encoder_ops, and op.get_weights()) so the final weights value reflects per-GPU memory correctly.src/aiconfigurator/sdk/pareto_analysis.py (1)
216-229: Document the new encoder kwargs indisagg_paretodocstring.
encoder_databaseandencoder_backend_nameare now accepted and used, but they’re missing from the**kwargscontract. Adding them will make this API easier to use correctly.Also applies to: 255-262
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/aiconfigurator/sdk/pareto_analysis.py` around lines 216 - 229, The disagg_pareto docstring's kwargs contract is missing the new encoder kwargs; update the docstring for disagg_pareto to include entries for encoder_database and encoder_backend_name (both where kwargs are listed around the current block and the repeated section later), stating they are accepted, the expected types (e.g., encoder_database: str or DB object, encoder_backend_name: str), and a brief note about their effect (which encoder to use and where to load it from); reference the disagg_pareto function and ensure both occurrences (the block around lines 216-229 and the repeated block around 255-262) are updated consistently.src/aiconfigurator/sdk/backends/sglang_backend.py (1)
522-526: Scale encoder weights consistently in per-GPU memory calculation.On Line 522, weights are converted to per-GPU (
/ pp_size), but Lines 524-525 add full encoder weights afterward. Forpp_size > 1, this can overestimate memory and incorrectly trigger OOM filtering.♻️ Proposed fix
- # Count weights on a single GPU - weights /= model.config.pp_size - - for op in model.encoder_ops: - weights += op.get_weights() + for op in model.encoder_ops: + weights += op.get_weights() + + # Count weights on a single GPU + weights /= model.config.pp_size🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/aiconfigurator/sdk/backends/sglang_backend.py` around lines 522 - 526, The per-GPU scaling is inconsistent: you divide weights by model.config.pp_size but then add full encoder weights from model.encoder_ops (via op.get_weights()), which overcounts when pp_size>1; fix by applying the same per-GPU scaling to encoder weights (e.g., add op.get_weights() / model.config.pp_size or sum encoder weights first then divide), updating the block that manipulates weights, model.config.pp_size, model.encoder_ops, and op.get_weights() so all components are scaled consistently.tests/unit/sdk/test_utils.py (1)
663-699: Add a regression assert fordeepstack_visual_indexes.The new parser branch preserves this field on
VisionEncoderConfig, but the fixture omits it and none of the new tests verify tuple conversion. A regression there would still pass even though the VL encoder wiring depends on it.Suggested test addition
_QWEN3VL_HF_CONFIG = { "architectures": [_QWEN3VL_ARCH], @@ "vision_config": { + "deepstack_visual_indexes": [8, 16, 24], "depth": 27, "hidden_size": 1152, "num_heads": 16, "intermediate_size": 4304, "patch_size": 16, "spatial_merge_size": 2, "temporal_patch_size": 2, "out_hidden_size": 5120, }, } @@ class TestQwen3VLVisionEncoderParsing: @@ def test_vision_config_not_lost_after_text_config_unwrap(self): """Regression: vision_config must be captured before the text_config overwrite.""" result = _parse_hf_config_json(_QWEN3VL_HF_CONFIG) assert result["extra_params"] is not None + + def test_vision_encoder_deepstack_visual_indexes(self): + result = _parse_hf_config_json(_QWEN3VL_HF_CONFIG) + assert result["extra_params"].deepstack_visual_indexes == (8, 16, 24)Also applies to: 743-793
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/sdk/test_utils.py` around lines 663 - 699, The fixture _QWEN3VL_HF_CONFIG is missing the deepstack_visual_indexes check; add an assertion in tests/unit/sdk/test_utils.py that after parsing the _QWEN3VL_HF_CONFIG into a VisionEncoderConfig (or using the parser code path that produces VisionEncoderConfig) the deepstack_visual_indexes field exists and is of type tuple (or converted to tuple) and contains the expected values; specifically, locate where VisionEncoderConfig is constructed/validated in the test utility handling for _QWEN3VL_HF_CONFIG and add a short regression assert verifying isinstance(parsed_config.deepstack_visual_indexes, tuple) and that its contents match the fixture expectation so tuple conversion is enforced.tests/unit/sdk/models/test_model_config.py (1)
403-413: Consider adding test forencoder_proj_to_llm_gemmop.The PR summary mentions the projection to LLM as a key encoder component, but this op name isn't asserted in
test_encoder_op_names. If this op exists, adding it would improve coverage.💡 Suggested addition
def test_encoder_op_names(self, vl_model): """All expected encoder op names must be present.""" names = [op._name for op in vl_model.encoder_ops] assert "encoder_qkv_gemm" in names assert "encoder_attention" in names assert "encoder_proj_gemm" in names assert "encoder_ffn1_gemm" in names assert "encoder_ffn2_gemm" in names assert "encoder_merger_fc1" in names assert "encoder_merger_act" in names assert "encoder_merger_fc2" in names + assert "encoder_proj_to_llm_gemm" in names # projection from ViT to LLM hidden dim🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/sdk/models/test_model_config.py` around lines 403 - 413, Add an assertion for the projection-to-LLM encoder op in the test_encoder_op_names test: check that "encoder_proj_to_llm_gemm" is present in the list of op names generated from vl_model.encoder_ops (names = [op._name for op in vl_model.encoder_ops]) alongside the existing assertions to ensure the projection-to-LLM op is covered.tests/unit/sdk/backends/test_encoder.py (1)
207-217: Add at least onetp_size > 1encoder instantiation here.All current model-level coverage uses the default config, so it never exercises the new TP-sharded encoder path. A small VL case with
tp_size > 1would catch regressions like invalid local ViT head counts or missing encoder collectives much earlier.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/sdk/backends/test_encoder.py` around lines 207 - 217, Add a new test that exercises the TP-sharded encoder path by creating a ModelConfig with tp_size > 1 (e.g., set model_config.tp_size = 2 or instantiate a new config with tp_size=2), then call get_model for a visual-LM model (e.g., "Qwen/Qwen3-VL-32B-Instruct") using that config and backend "trtllm", and assert encoder_ops is non-empty; reference the existing test helpers and symbols get_model and model.encoder_ops so the test runs alongside test_vl_model_has_non_empty_encoder_ops and will catch TP-sharded encoder regressions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/aiconfigurator/cli/report_and_save.py`:
- Around line 175-184: The code fails on Pandas NaN for encoder fields and omits
encoder contribution from the replica GPU math; update the parsing of encoder
fields (e_tp, e_pp and any encoder worker/replica counts like "(e)workers" and
"(e)dp") to safely coerce NaN to 1 (e.g., use pandas.isna(row.get(...)) or a
small helper that returns int(value) if valid else 1) before computing e_gpus
and e_gpus_str, and include the encoder term in gpus_replica_str arithmetic and
string composition (add the encoder part = (e)workers x ((e)pp * (e)tp * (e)dp)
and reflect it in the total GPU count) so replica composition and total_gpus
account for encoders consistently with the existing p/d terms.
In `@src/aiconfigurator/sdk/models.py`:
- Around line 938-945: The vision encoder TP split isn't validated before doing
integer division (e.g., computing head_size_vit = h_vit // n_vit and other
usages of // tp_size), so TP sizes that don't evenly divide vision dims can drop
channels or yield zero; add the same divisibility checks used for the text
backbone to validate tp_size against the vision encoder dimensions (h_vit,
n_vit, inter_vit and any merger/FFN widths derived from h_vit and h_llm) before
performing floor-division, and raise a clear exception or assert (with context
mentioning tp_size and the offending dimension) if the split is invalid so the
code doesn't silently continue with truncated/zero sizes.
- Around line 951-1025: The encoder GEMMs ("encoder_proj_gemm",
"encoder_ffn2_gemm", "encoder_merger_fc2") are TP-sharded but lack the
corresponding TP collectives; add matching all-reduce (or the same collective
primitive used elsewhere for row-parallel GEMMs) immediately after each of these
ops when tp_size > 1 so the encoder path mirrors the LLM row-parallel pattern
and yields correct TTFT/energy estimates; ensure you use the same collective op
type and semantics (datatype/scale/axis) as used for other GEMM collectives in
the file and honor low_precision_input for "encoder_proj_gemm" and
"encoder_ffn2_gemm" when constructing the collective.
---
Outside diff comments:
In `@collector/trtllm/collect_attn.py`:
- Around line 298-316: The INT32 overflow check currently multiplies by a
hard-coded 128 which ignores the loop variable head size and causes
configurations with head_dim=72 to be skipped; update the overflow guard in the
nested loops (the block iterating head_dim, n, s, b, n_kv and the check that
currently reads something like b * s * num_kv_heads * 128 * 2 >= 2147483647) to
use the loop variable h instead of the constant 128 (i.e., use b * s *
num_kv_heads * h * 2 >= 2147483647) so all head_dim values are evaluated
correctly.
In `@src/aiconfigurator/sdk/backends/base_backend.py`:
- Around line 296-319: The memory calculation underestimates activation usage
for vision-language models because _get_memory_usage is called with isl instead
of the augmented token count that includes img_ctx_tokens; update calls in the
"static_ctx", "static_gen", and "static" branches so _get_memory_usage receives
isl + img_ctx_tokens (or explicitly set num_tokens to include img_ctx_tokens,
e.g., num_tokens=(isl + img_ctx_tokens) or num_tokens=batch_size*beam_width
where appropriate), ensuring any prefix/osl arguments remain unchanged;
reference the functions/variables _get_memory_usage, _run_encoder
(img_ctx_tokens), and the mode branches ("static_ctx", "static_gen", "static")
to locate and modify the calls.
In `@src/aiconfigurator/sdk/inference_summary.py`:
- Around line 248-266: Docstrings for set_encoder_power_avg and
get_encoder_power_avg incorrectly say "context phase average power"; update both
docstrings to read "encoder phase average power (watts)". Locate the methods
set_encoder_power_avg and get_encoder_power_avg in inference_summary.py and
change their docstring text to accurately reference the encoder phase so it
matches the method purpose.
In `@src/aiconfigurator/webapp/events/event_fn.py`:
- Around line 230-236: The tuple of gr.update outputs in the handler puts the
error message ("ERROR!!!") into the second output (encoder) rather than the
first (summary); update the return tuple in the function in event_fn.py so that
the error string (stdout_text + stderr_text + traceback_log or "ERROR!!!") is
assigned to the first gr.update (summary) and the encoder breakdown gr.update
receives the appropriate encoder value (or empty string), i.e., swap/reorder the
gr.update(...) entries so summary output is populated with the error and the
encoder output is the second element.
---
Nitpick comments:
In `@collector/vllm/collect_attn.py`:
- Around line 517-521: The current preventive filter in collect_attn.py uses
get_sm_version() and the n // n_kv > 16 check with a continue to skip test cases
on SM100; remove that early gating so test cases are generated regardless and
let runtime errors be handled by the collector's error handler (i.e., delete the
conditional continue that references get_sm_version(), n and n_kv), ensuring we
do not pre-filter cases based on GQA ratio and allowing
trtllm_batch_decode_with_kv_cache failures to be caught non-fatally at
collection time.
In `@src/aiconfigurator/sdk/backends/sglang_backend.py`:
- Around line 522-526: The per-GPU scaling is inconsistent: you divide weights
by model.config.pp_size but then add full encoder weights from model.encoder_ops
(via op.get_weights()), which overcounts when pp_size>1; fix by applying the
same per-GPU scaling to encoder weights (e.g., add op.get_weights() /
model.config.pp_size or sum encoder weights first then divide), updating the
block that manipulates weights, model.config.pp_size, model.encoder_ops, and
op.get_weights() so all components are scaled consistently.
In `@src/aiconfigurator/sdk/backends/trtllm_backend.py`:
- Around line 592-596: The current calculation divides weights by
model.config.pp_size and then adds unnormalized encoder weights from
model.encoder_ops (op.get_weights()), which yields incorrect per-GPU estimates;
fix by ensuring encoder op weights are normalized by the same pipeline-parallel
factor—either add encoder weights to weights before dividing by
model.config.pp_size, or divide each op.get_weights() by model.config.pp_size
when adding (refer to the variables/functions weights, model.config.pp_size,
model.encoder_ops, and op.get_weights()) so the final weights value reflects
per-GPU memory correctly.
In `@src/aiconfigurator/sdk/pareto_analysis.py`:
- Around line 216-229: The disagg_pareto docstring's kwargs contract is missing
the new encoder kwargs; update the docstring for disagg_pareto to include
entries for encoder_database and encoder_backend_name (both where kwargs are
listed around the current block and the repeated section later), stating they
are accepted, the expected types (e.g., encoder_database: str or DB object,
encoder_backend_name: str), and a brief note about their effect (which encoder
to use and where to load it from); reference the disagg_pareto function and
ensure both occurrences (the block around lines 216-229 and the repeated block
around 255-262) are updated consistently.
In `@src/aiconfigurator/sdk/task.py`:
- Around line 1334-1335: Remove the always-None placeholder locals
encoder_database and encoder_backend_name and stop forwarding them into
run_disagg unless they carry real config values; update the code that
sets/defines these locals (encoder_database, encoder_backend_name) to either
derive actual values from the configuration or omit the keyword args when
calling run_disagg, and change any call sites that currently pass
encoder_database=encoder_database or encoder_backend_name=encoder_backend_name
to conditionally include those kwargs only if not None (or remove them entirely)
so run_disagg only receives meaningful encoder-related arguments.
In `@tests/unit/sdk/backends/test_encoder.py`:
- Around line 207-217: Add a new test that exercises the TP-sharded encoder path
by creating a ModelConfig with tp_size > 1 (e.g., set model_config.tp_size = 2
or instantiate a new config with tp_size=2), then call get_model for a visual-LM
model (e.g., "Qwen/Qwen3-VL-32B-Instruct") using that config and backend
"trtllm", and assert encoder_ops is non-empty; reference the existing test
helpers and symbols get_model and model.encoder_ops so the test runs alongside
test_vl_model_has_non_empty_encoder_ops and will catch TP-sharded encoder
regressions.
In `@tests/unit/sdk/models/test_model_config.py`:
- Around line 403-413: Add an assertion for the projection-to-LLM encoder op in
the test_encoder_op_names test: check that "encoder_proj_to_llm_gemm" is present
in the list of op names generated from vl_model.encoder_ops (names = [op._name
for op in vl_model.encoder_ops]) alongside the existing assertions to ensure the
projection-to-LLM op is covered.
In `@tests/unit/sdk/test_utils.py`:
- Around line 663-699: The fixture _QWEN3VL_HF_CONFIG is missing the
deepstack_visual_indexes check; add an assertion in tests/unit/sdk/test_utils.py
that after parsing the _QWEN3VL_HF_CONFIG into a VisionEncoderConfig (or using
the parser code path that produces VisionEncoderConfig) the
deepstack_visual_indexes field exists and is of type tuple (or converted to
tuple) and contains the expected values; specifically, locate where
VisionEncoderConfig is constructed/validated in the test utility handling for
_QWEN3VL_HF_CONFIG and add a short regression assert verifying
isinstance(parsed_config.deepstack_visual_indexes, tuple) and that its contents
match the fixture expectation so tuple conversion is enforced.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: ce9b18c5-a8c1-4781-b439-ac3ccd08afc5
📒 Files selected for processing (31)
collector/sglang/collect_attn.pycollector/trtllm/collect_attn.pycollector/vllm/collect_attn.pysrc/aiconfigurator/cli/report_and_save.pysrc/aiconfigurator/model_configs/Qwen--Qwen3-VL-235B-A22B-Instruct_config.jsonsrc/aiconfigurator/model_configs/Qwen--Qwen3-VL-2B-Instruct_config.jsonsrc/aiconfigurator/model_configs/Qwen--Qwen3-VL-30B-A3B-Instruct_config.jsonsrc/aiconfigurator/model_configs/Qwen--Qwen3-VL-32B-Instruct_config.jsonsrc/aiconfigurator/model_configs/Qwen--Qwen3-VL-32B-Thinking_config.jsonsrc/aiconfigurator/model_configs/Qwen--Qwen3-VL-4B-Instruct_config.jsonsrc/aiconfigurator/model_configs/Qwen--Qwen3-VL-8B-Instruct_config.jsonsrc/aiconfigurator/sdk/backends/base_backend.pysrc/aiconfigurator/sdk/backends/sglang_backend.pysrc/aiconfigurator/sdk/backends/trtllm_backend.pysrc/aiconfigurator/sdk/common.pysrc/aiconfigurator/sdk/config.pysrc/aiconfigurator/sdk/inference_session.pysrc/aiconfigurator/sdk/inference_summary.pysrc/aiconfigurator/sdk/models.pysrc/aiconfigurator/sdk/pareto_analysis.pysrc/aiconfigurator/sdk/picking.pysrc/aiconfigurator/sdk/task.pysrc/aiconfigurator/sdk/utils.pysrc/aiconfigurator/systems/data/a100_sxm/vllm/0.14.0/context_attention_perf.txtsrc/aiconfigurator/systems/data/a100_sxm/vllm/0.14.0/generation_attention_perf.txtsrc/aiconfigurator/webapp/events/event_fn.pysrc/aiconfigurator/webapp/events/event_handler.pytests/unit/sdk/backends/test_encoder.pytests/unit/sdk/models/test_model_config.pytests/unit/sdk/test_common.pytests/unit/sdk/test_utils.py
| e_tp = int(row.get("(e)tp", 1) or 1) | ||
| e_pp = int(row.get("(e)pp", 1) or 1) | ||
| e_gpus = e_tp * e_pp | ||
| e_gpus_str = f"{e_gpus} (=\033[4m{e_tp}\033[0mx\033[4m{e_pp}\033[0m)" | ||
| e_parallel_str = f"tp\033[4m{e_tp}\033[0mpp\033[4m{e_pp}\033[0m" | ||
| gpus_replica_str = ( | ||
| f"{row['num_total_gpus']} " | ||
| f"(={row['(p)workers']}x{row['(p)pp'] * row['(p)tp'] * row['(p)dp']}" | ||
| f"+{row['(d)workers']}x{row['(d)pp'] * row['(d)tp'] * row['(d)dp']})" | ||
| ) |
There was a problem hiding this comment.
Handle missing encoder fields safely and include them in the replica GPU math.
These encoder columns are optional, but int(row.get("(e)tp", 1) or 1) will still blow up on Pandas NaN, and the rendered gpus/replica string still ignores encoder workers entirely. That makes the new disagg summary unreliable for non-VL rows and undercounts replica composition for VL rows.
Suggested fix
- e_tp = int(row.get("(e)tp", 1) or 1)
- e_pp = int(row.get("(e)pp", 1) or 1)
+ e_workers = 0 if pd.isna(row.get("(e)workers", 0)) else int(row.get("(e)workers", 0))
+ e_tp = 1 if pd.isna(row.get("(e)tp", 1)) else int(row.get("(e)tp", 1))
+ e_pp = 1 if pd.isna(row.get("(e)pp", 1)) else int(row.get("(e)pp", 1))
e_gpus = e_tp * e_pp
- e_gpus_str = f"{e_gpus} (=\033[4m{e_tp}\033[0mx\033[4m{e_pp}\033[0m)"
- e_parallel_str = f"tp\033[4m{e_tp}\033[0mpp\033[4m{e_pp}\033[0m"
- gpus_replica_str = (
- f"{row['num_total_gpus']} "
- f"(={row['(p)workers']}x{row['(p)pp'] * row['(p)tp'] * row['(p)dp']}"
- f"+{row['(d)workers']}x{row['(d)pp'] * row['(d)tp'] * row['(d)dp']})"
- )
+ replica_terms = [
+ f"{row['(p)workers']}x{row['(p)pp'] * row['(p)tp'] * row['(p)dp']}",
+ f"{row['(d)workers']}x{row['(d)pp'] * row['(d)tp'] * row['(d)dp']}",
+ ]
+ if e_workers:
+ replica_terms.append(f"{e_workers}x{e_gpus}")
+ gpus_replica_str = f"{row['num_total_gpus']} (={' + '.join(replica_terms)})"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| e_tp = int(row.get("(e)tp", 1) or 1) | |
| e_pp = int(row.get("(e)pp", 1) or 1) | |
| e_gpus = e_tp * e_pp | |
| e_gpus_str = f"{e_gpus} (=\033[4m{e_tp}\033[0mx\033[4m{e_pp}\033[0m)" | |
| e_parallel_str = f"tp\033[4m{e_tp}\033[0mpp\033[4m{e_pp}\033[0m" | |
| gpus_replica_str = ( | |
| f"{row['num_total_gpus']} " | |
| f"(={row['(p)workers']}x{row['(p)pp'] * row['(p)tp'] * row['(p)dp']}" | |
| f"+{row['(d)workers']}x{row['(d)pp'] * row['(d)tp'] * row['(d)dp']})" | |
| ) | |
| e_workers = 0 if pd.isna(row.get("(e)workers", 0)) else int(row.get("(e)workers", 0)) | |
| e_tp = 1 if pd.isna(row.get("(e)tp", 1)) else int(row.get("(e)tp", 1)) | |
| e_pp = 1 if pd.isna(row.get("(e)pp", 1)) else int(row.get("(e)pp", 1)) | |
| e_gpus = e_tp * e_pp | |
| replica_terms = [ | |
| f"{row['(p)workers']}x{row['(p)pp'] * row['(p)tp'] * row['(p)dp']}", | |
| f"{row['(d)workers']}x{row['(d)pp'] * row['(d)tp'] * row['(d)dp']}", | |
| ] | |
| if e_workers: | |
| replica_terms.append(f"{e_workers}x{e_gpus}") | |
| gpus_replica_str = f"{row['num_total_gpus']} (={' + '.join(replica_terms)})" |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/aiconfigurator/cli/report_and_save.py` around lines 175 - 184, The code
fails on Pandas NaN for encoder fields and omits encoder contribution from the
replica GPU math; update the parsing of encoder fields (e_tp, e_pp and any
encoder worker/replica counts like "(e)workers" and "(e)dp") to safely coerce
NaN to 1 (e.g., use pandas.isna(row.get(...)) or a small helper that returns
int(value) if valid else 1) before computing e_gpus and e_gpus_str, and include
the encoder term in gpus_replica_str arithmetic and string composition (add the
encoder part = (e)workers x ((e)pp * (e)tp * (e)dp) and reflect it in the total
GPU count) so replica composition and total_gpus account for encoders
consistently with the existing p/d terms.
| tp_size = self.config.tp_size | ||
| depth = encoder_config.depth | ||
| h_vit = encoder_config.hidden_size | ||
| n_vit = encoder_config.num_heads | ||
| inter_vit = encoder_config.intermediate_size | ||
| h_llm = encoder_config.out_hidden_size | ||
| head_size_vit = h_vit // n_vit # 1152 // 16 = 72 | ||
|
|
There was a problem hiding this comment.
Validate the vision encoder TP split before floor-dividing it.
The new encoder uses // tp_size for ViT heads, FFN width, and merger width, but only the text backbone is validated for TP divisibility. A TP size can be valid for the LLM and still be invalid for the vision encoder, which would silently drop channels or even produce 0 local ViT heads.
Suggested fix
tp_size = self.config.tp_size
depth = encoder_config.depth
h_vit = encoder_config.hidden_size
n_vit = encoder_config.num_heads
inter_vit = encoder_config.intermediate_size
h_llm = encoder_config.out_hidden_size
head_size_vit = h_vit // n_vit
+ merger_dim = h_vit * encoder_config.spatial_merge_size ** 2
+
+ if n_vit % tp_size != 0:
+ raise ValueError(f"vision num_heads ({n_vit}) must be divisible by tp_size ({tp_size})")
+ if inter_vit % tp_size != 0:
+ raise ValueError(
+ f"vision intermediate_size ({inter_vit}) must be divisible by tp_size ({tp_size})"
+ )
+ if merger_dim % tp_size != 0:
+ raise ValueError(
+ f"vision merger dim ({merger_dim}) must be divisible by tp_size ({tp_size})"
+ )Also applies to: 979-1023, 1368-1378, 1408-1449
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/aiconfigurator/sdk/models.py` around lines 938 - 945, The vision encoder
TP split isn't validated before doing integer division (e.g., computing
head_size_vit = h_vit // n_vit and other usages of // tp_size), so TP sizes that
don't evenly divide vision dims can drop channels or yield zero; add the same
divisibility checks used for the text backbone to validate tp_size against the
vision encoder dimensions (h_vit, n_vit, inter_vit and any merger/FFN widths
derived from h_vit and h_llm) before performing floor-division, and raise a
clear exception or assert (with context mentioning tp_size and the offending
dimension) if the split is invalid so the code doesn't silently continue with
truncated/zero sizes.
| self.encoder_ops.extend( | ||
| [ | ||
| ops.ElementWise("encoder_add_norm_1", depth, 2 * h_vit, 2 * h_vit, 0.8), | ||
| ops.GEMM( | ||
| "encoder_qkv_gemm", | ||
| depth, | ||
| 3 * n_vit * head_size_vit // tp_size, | ||
| h_vit, | ||
| vit_gemm_mode, | ||
| ), | ||
| ops.ContextAttention( | ||
| "encoder_attention", | ||
| depth, | ||
| n_vit // tp_size, | ||
| n_vit // tp_size, # ViT has no GQA: n_kv == n | ||
| vit_kvcache_mode, | ||
| vit_fmha_mode, | ||
| head_size=head_size_vit, | ||
| ), | ||
| ops.GEMM( | ||
| "encoder_proj_gemm", | ||
| depth, | ||
| h_vit, | ||
| n_vit * head_size_vit // tp_size, | ||
| vit_gemm_mode, | ||
| low_precision_input=True, | ||
| ), | ||
| ops.ElementWise("encoder_add_norm_2", depth, 2 * h_vit, 2 * h_vit, 0.8), | ||
| ops.GEMM( | ||
| "encoder_ffn1_gemm", | ||
| depth, | ||
| inter_vit // tp_size, | ||
| h_vit, | ||
| vit_gemm_mode, | ||
| ), | ||
| ops.ElementWise( | ||
| "encoder_act", | ||
| depth, | ||
| inter_vit // tp_size, | ||
| inter_vit // tp_size, | ||
| 0.8, | ||
| ), | ||
| ops.GEMM( | ||
| "encoder_ffn2_gemm", | ||
| depth, | ||
| h_vit, | ||
| inter_vit // tp_size, | ||
| vit_gemm_mode, | ||
| low_precision_input=True, | ||
| ), | ||
| # PatchMerger MLP: runs n_mergers times (1 final + deepstack instances) | ||
| # Each merger: Linear(4*h_vit→4*h_vit) + GELU + Linear(4*h_vit→h_llm) | ||
| # Operates on post-merge tokens (spatial_merge_size² patches fused per token) | ||
| ops.GEMM( | ||
| "encoder_merger_fc1", | ||
| 1 + len(encoder_config.deepstack_visual_indexes), | ||
| (h_vit * encoder_config.spatial_merge_size ** 2) // tp_size, | ||
| h_vit * encoder_config.spatial_merge_size ** 2, | ||
| vit_gemm_mode, | ||
| ), | ||
| ops.ElementWise( | ||
| "encoder_merger_act", | ||
| 1 + len(encoder_config.deepstack_visual_indexes), | ||
| (h_vit * encoder_config.spatial_merge_size ** 2) // tp_size, | ||
| (h_vit * encoder_config.spatial_merge_size ** 2) // tp_size, | ||
| 0.8, | ||
| ), | ||
| ops.GEMM( | ||
| "encoder_merger_fc2", | ||
| 1 + len(encoder_config.deepstack_visual_indexes), | ||
| h_llm, | ||
| (h_vit * encoder_config.spatial_merge_size ** 2) // tp_size, | ||
| vit_gemm_mode, | ||
| ), | ||
| ] |
There was a problem hiding this comment.
Model the encoder TP collectives too.
These encoder GEMMs are TP-sharded, but unlike the LLM paths there is no matching all-reduce after encoder_proj_gemm, encoder_ffn2_gemm, or encoder_merger_fc2. For tp_size > 1 that will systematically under-estimate encoder TTFT/energy and make the encoder path inconsistent with the row-parallel pattern used everywhere else.
Suggested fix
ops.GEMM(
"encoder_proj_gemm",
depth,
h_vit,
n_vit * head_size_vit // tp_size,
vit_gemm_mode,
low_precision_input=True,
),
+ ops.CustomAllReduce("encoder_ar_1", depth, h_vit, tp_size),
ops.ElementWise("encoder_add_norm_2", depth, 2 * h_vit, 2 * h_vit, 0.8),
ops.GEMM(
"encoder_ffn1_gemm",
depth,
inter_vit // tp_size,
@@
ops.GEMM(
"encoder_ffn2_gemm",
depth,
h_vit,
inter_vit // tp_size,
vit_gemm_mode,
low_precision_input=True,
),
+ ops.CustomAllReduce("encoder_ar_2", depth, h_vit, tp_size),
@@
ops.GEMM(
"encoder_merger_fc2",
1 + len(encoder_config.deepstack_visual_indexes),
h_llm,
(h_vit * encoder_config.spatial_merge_size ** 2) // tp_size,
vit_gemm_mode,
),
+ ops.CustomAllReduce(
+ "encoder_merger_ar",
+ 1 + len(encoder_config.deepstack_visual_indexes),
+ h_llm,
+ tp_size,
+ ),Also applies to: 1380-1452
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/aiconfigurator/sdk/models.py` around lines 951 - 1025, The encoder GEMMs
("encoder_proj_gemm", "encoder_ffn2_gemm", "encoder_merger_fc2") are TP-sharded
but lack the corresponding TP collectives; add matching all-reduce (or the same
collective primitive used elsewhere for row-parallel GEMMs) immediately after
each of these ops when tp_size > 1 so the encoder path mirrors the LLM
row-parallel pattern and yields correct TTFT/energy estimates; ensure you use
the same collective op type and semantics (datatype/scale/axis) as used for
other GEMM collectives in the file and honor low_precision_input for
"encoder_proj_gemm" and "encoder_ffn2_gemm" when constructing the collective.
98011da to
1937c88
Compare
1937c88 to
7b220d0
Compare
Signed-off-by: Son, Sungwook <sungwook.son@intel.com>
7b220d0 to
e3aaaf7
Compare
- Replace ambiguous multiplication sign (×) with x in docstrings - Break long lines >120 chars in base_backend.py and inference_summary.py - Remove trailing whitespace in models.py and inference_summary.py - Auto-format pareto_analysis.py function call arguments Signed-off-by: Son, Sungwook <sungwook.son@intel.com>
8405ddf to
a519a6e
Compare
Overview:
Adds end-to-end latency estimation for Qwen3-VL family vision-language models. This includes the full ViT encoder pipeline attention, FFN, PatchMerger MLP, and projection to LLM and correctly feeds post-merge image tokens into the LLM backbone ISL so that TTFT accounts for both encoder and prefill compute.
Details:
integration (55 tests total, all passing).
Where should the reviewer start?
Summary by CodeRabbit
Release Notes
New Features
Improvements
Tests