Skip to content

fix(vision): resolve #23 color misidentification + address PR #22 review#35

Open
nickfinease wants to merge 11 commits intoKaden-Schutt:masterfrom
nickfinease:fix/vision-bundle-rebased
Open

fix(vision): resolve #23 color misidentification + address PR #22 review#35
nickfinease wants to merge 11 commits intoKaden-Schutt:masterfrom
nickfinease:fix/vision-bundle-rebased

Conversation

@nickfinease
Copy link
Copy Markdown

Summary

Supersedes #22. Rebased onto current upstream/master and addresses the review feedback there, plus lands the root-cause fix for issue #23 (VL model misidentifies green/blue objects).

Branch history (11 commits on top of upstream/master):

e9c50c1 fix(vision): swap G and B in preprocessing to fix color identification (#23)
4e637e8 feat(vision): log weight quant format at load time
872c5ba fix(vision): encode upload shape in elements, not bytes
6fcdddc cleanup(kernels): remove dead gemm_f16_wmma_tiled and vit_attention_flash
5f1681c fix(vision): implement smart_resize matching HuggingFace Qwen2VLImageProcessorFast
277bd6f feat(quantizer): add --vision-format bf16 flag and BF16 QuantType
c78a864 fix(vision): use group_size from file for HFQ4 dequant, convert F32->F16 for gemm_f16
6431f73 fix(vision): add HFQ4 dequant support to load_f32_gpu for norm/bias weights
e02d5a8 feat: add --vision-quant hfq4 flag for vision weight quantization
11a3740 perf(vision): revert tiled attention — L2 cache already handles tiling
fd363d1 perf(vision): remove per-layer device_synchronize barriers

Issue #23 — green/blue swap (the headline fix)

Empirical probing with pure-color PNGs, temp=0 greedy decoding, qwen3.5-9b-vl-hfq4.mq4:

input PNG RGB-order (pre-fix) R↔B swap B↔G swap (this PR)
red "Red" ✓ "Green" "Red" ✓
green "Blue" ✗ "Blue" "Green" ✓
blue "Green" ✗ "Red" "Blue" ✓

The two wrong mappings are a classic G↔B transposition. Root cause is most likely a channel permutation in the HuggingFace patch_embed weight export (input conv channels 1 and 2 appear transposed) — I did not chase that all the way into the export pipeline, but the preprocessing swap here resolves the end-to-end symptom and matches the pure-color ground truth cleanly.

End-to-end verification

On the rebased branch with the fix applied:

$ ./target/release/examples/infer \
    /home/beans/.hipfire/models/qwen3.5-9b-vl-hfq4.mq4 \
    --image /home/beans/Pictures/pngtree-green-leaf-for-element-png-image_11934356.png \
    --no-think --max-tokens 80 \
    "What color is this leaf? Answer in one short sentence."

vision weight format: HFQ4-G256 (dequanting to F16 on load)
Prompt: 144 tokens (121 visual + 23 text)
Prefill: 1253ms (115 tok/s)
<think>

</think>

 Green.
=== Done: 7 tokens in 48ms (145.8 tok/s) ===

Pre-fix, the same image on the same model produced "Blue."

Regression test

crates/engine/tests/channel_order.rs pins the contract with pure-color + mixed-pixel fixtures. Runs without a GPU — part of the default cargo test path. If someone removes the swap, these assertions fail before any vision inference runs.

running 4 tests
test pure_blue_routes_b_byte_to_channel_1 ... ok
test mixed_pixel_round_trips_all_three_bytes ... ok
test pure_red_preserves_red_channel ... ok
test pure_green_routes_g_byte_to_channel_2 ... ok
test result: ok. 4 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

PR #22 review follow-ups

The review blockers on #22 that I believe are addressed here:

  • Dead kernel filesgemm_f16_wmma_tiled.hip and vit_attention_flash.hip removed. Neither was referenced via ensure_kernel dispatch; both were stale experimental variants (commit 6fcdddc).
  • Shape encoding bug at qwen35_vl.rs:118load_f16_gpu was passing data.len() (bytes) as the tensor shape on both the F16-direct and HFQ4-dequant paths. Corrected to pass the element count n. The F16-direct path at line 110 had the same bug and is included in the fix (commit 872c5ba).
  • HFQ4 vision load-path visibility — the HFQ4 auto-dequant-to-F16 transition now logs at load time (vision weight format: HFQ4-G256 (dequanting to F16 on load)) so HFQ4 and F16 models can be told apart during debugging (commit 4e637e8).
  • Rebased onto upstream/master — one manual conflict in crates/rdna-compute/src/dispatch.rs where this branch's vit_attention_opt landed at the same spot as upstream's new attention_dflash_f32. Resolved by keeping both as siblings.

Outstanding from #22 review (not in this PR):

  • ⚠️ gemm_f16_wmma correctness claim — the original Vision encoder: HFQ4 quantization, sync optimization, 1.5x speedup #22 description mentioned a wmma correctness fix. I couldn't reproduce a failing test against the current gemm_f16_wmma path on this branch and don't have a concrete reproducer to hand off. I'd rather retract the claim than ship it unexplained; happy to investigate further if you can point me at the failure mode you had in mind.

Gates

  • cargo test -p engine --test channel_order — 4/4 pass.
  • cargo build --release --features deltanet -p engine --example daemon --example infer — clean (pre-existing warnings only).
  • ./scripts/coherence-gate.sh on qwen3.5-9b.mq4 — text-only reasoning path OK post-rebase (/tmp/coherence-20260420-090512.md). 0.8b and 4b skipped (not present in my models dir).
  • cargo clippy -p engine --tests --features deltanet --no-deps — no new errors or warnings attributable to this PR.
  • ./scripts/speed-gate.sh --fast — skipped (4B model not present locally). Expect CI / a tester with qwen3.5-4b.mq4 available to confirm no regression.

Tested on gfx1100 / RX 7900 XTX / ROCm 7.2.

Test plan

  • Reviewer: cherry-pick e9c50c1 locally, re-run the pure-color matrix on qwen3.5-9b-vl-hfq4.mq4 (or whichever VL model reproduced Vision encoder color accuracy: F32 computation path to match HuggingFace BF16 precision #23 for you) — confirm red/green/blue all identified correctly.
  • Reviewer with a natural photo: run a real-world image of a known-colored object through infer --image and confirm the color word is now correct.
  • Run cargo test -p engine --test channel_order to confirm the preprocessing contract passes on your box.
  • Optional: point the follow-up investigation at the HF patch_embed weight export to confirm (or rule out) the channel-permutation root cause, so the fix can eventually move out of preprocessing and into the weight converter.

Closes #23.

beanssec added 11 commits April 20, 2026 08:54
…f16_tiled/gemm_f16_bias dispatch stubs

- Removed device_synchronize() every 9 layers (3 syncs → 1 at end)
- Added gemm_f16_tiled kernel dispatch (ILP-unrolled, no LDS, correct but no faster than naive)
- Added gemm_f16_bias fused kernel dispatch (GEMM+transpose+bias in 1 launch, correct but
  fewer blocks reduces GPU parallelism — slightly slower on gfx1100)
- WMMA kernel (gemm_f16_wmma) has correctness bug — garbage output under some conditions
- Color accuracy (green→purple) not yet resolved — preprocessing confirmed correct,
  likely BF16→F16 weight conversion or vision encoder implementation issue

Ref: Kaden-Schutt#21
…g, software tiling adds overhead

Benchmarked vit_attention_opt with K_TILE=64, QPB=2:
- Naive: 6.9s (data fits in L2 cache, no sync overhead)
- Tiled: 8.2s (shared memory loads + syncs add ~20% overhead)

Conclusion: both gemm_f16 and vit_attention_f32 are already well-optimized
for this workload (784 patches, K=1152-4304). The data is small enough for
L2 cache to act as implicit tiling. Software tiling with shared memory adds
synchronization overhead that outweighs the benefit.

The real path to matching text model performance is quantizing vision weights
to HFQ4 format, enabling the existing optimized gemm_hfq4g256_wmma kernels.

Ref: Kaden-Schutt#21
Quantizer changes:
- Parse --vision-quant <format> flag (supports: hfq4)
- When --include-vision --vision-quant hfq4 is set, vision weights are
  quantized to HFQ4G256/HFQ4G128 instead of stored as F16
- ~4x smaller vision weights (228MB vs 912MB)

Engine changes:
- load_f16_gpu now handles qt=6 (HFQ4G256) and qt=7 (HFQ4G128)
  by dequantizing to F32 on CPU before GPU upload
- Added dequant_hfq4g256 and dequant_hfq4g128 helper functions
- Vision encoder still uses F32 weights (dequantized from HFQ4)
  and naive gemm_f16 kernel — no GEMM changes needed

Usage:
  hipfire-quantize --input ~/models/Qwen3.5-9B --output qwen3.5-9b-vl-hfq4.mq4 \
    --format mq4g256 --include-vision --vision-quant hfq4

Note: This dequantizes on CPU during load. Future optimization:
use HFQ4 GPU dequant kernels for on-the-fly dequantization in
the vision GEMM path (same as text model's gemm_hfq4g256_wmma).

Ref: Kaden-Schutt#21
…eights

When --vision-quant hfq4 is used, ALL vision tensors (including norms
and biases) are stored as HFQ4G128. load_f32_gpu previously asserted
qt=1 or qt=2, causing a panic on qt=7 (HFQ4G128).

Now handles qt=6 (HFQ4G256) and qt=7 (HFQ4G128) via CPU dequantization.

KNOWN ISSUE: HFQ4 vision weights produce garbage output (! chars).
The dequantization math is verified correct (unit tests pass).
Issue is likely in the HFQ file's data layout for vision tensors —
the cumulative data_offset calculation may be wrong for HFQ4 tensors
when mixed with MQ4 text tensors in the same file.

Ref: Kaden-Schutt#21
…F16 for gemm_f16

ROOT CAUSE of garbage output was twofold:
1. Dequant used qt-based block sizes but file had different gs than expected
2. Dequant uploaded F32 data but gemm_f16 kernel reads F16 bytes

Fix: use info.group_size for block size, convert dequant F32->F16 before upload.

BENCHMARK (qwen3.5-9b-vl, 448px image, 256 max_tokens):
  F16 vision:  7.5s warm, 10.4 GB file
  HFQ4 vision: 5.0s warm,  5.5 GB file
  Speedup: 1.5x faster, 47% smaller file

Usage:
  hipfire-quantize --input ~/models/Qwen3.5-9B \
    --output qwen3.5-9b-vl-hfq4.mq4 \
    --format mq4g256 --include-vision --vision-quant hfq4

Closes: Kaden-Schutt#21
Adds BF16 storage format (qt=16) for vision weights. When --vision-format
bf16 is used with --include-vision, BF16 source weights are stored as-is
(no BF16->F16 conversion). Non-BF16 sources fall back to F16.

Also adds:
- bf16_to_f32() conversion function in vision loader
- load_f16_gpu now outputs F32 for all formats (F16->F32, BF16->F32, HFQ4->F32)
- linear_f16 uses gemm_f32_batched instead of gemm_f16 (F32 compute path)

STATUS: Color accuracy not yet fixed. Intermediate tensor comparison shows
values match HuggingFace through pos_embed but diverge at layer 0 despite
matching weights. Root cause unknown — needs HuggingFace reference comparison
or community research for known Qwen3.5 vision issues.

Usage:
  hipfire-quantize --input ~/models/Qwen3.5-9B \
    --output qwen3.5-9b-vl-bf16.mq4 \
    --format mq4g256 --include-vision --vision-format bf16

Ref: Kaden-Schutt#23
…ProcessorFast

Replaces hardcoded resize_exact(448, 448) with smart_resize using
factor=28 and min/max pixel bounds. For a 360x360 image:

Before: 448x448 (784 patches, 28x28 grid) — WRONG
After:  352x352 (484 patches, 22x22 grid) — matches HuggingFace

The factor=28 ensures the image grid is always a multiple of
spatial_merge_size (2), preventing the panic from uneven patch counts.

Color accuracy: still purple. The resolution fix was necessary but not
sufficient. Root cause of color shift remains unknown — needs
HuggingFace reference comparison at layer 0 to identify the divergence.

Ref: Kaden-Schutt#23
…lash

Neither kernel was referenced via ensure_kernel dispatch; both were
stale experimental variants superseded by the active vit_attention
and gemm_f16 / gemm_f16_tiled paths. Carries no runtime impact.

Addresses PR Kaden-Schutt#22 review comment re: dead kernel files.
load_f16_gpu was passing byte-length (data.len() / f16_bytes.len())
as the shape array to gpu.upload_raw, so downstream shape-aware
dispatch saw a tensor shaped [byte_count] instead of
[element_count]. Corrected both the F16-direct and HFQ4-dequant
paths to pass the computed element count n.

Addresses PR Kaden-Schutt#22 review feedback on qwen35_vl.rs:118.
The HFQ4 vision path (qt=6 G256, qt=7 G128) auto-dequantizes weights
to F16 at load time because there's no GPU HFQ4 kernel for vision
yet. Without visibility into this, an HFQ4 model looks identical to
an F16 model at load, making it hard to attribute perf or memory
differences during debugging. Now prints a single line at load:

    vision weight format: HFQ4-G256 (dequanting to F16 on load)

Addresses PR Kaden-Schutt#22 review request for better weight-format
observability.
Kaden-Schutt#23)

Empirical finding from pure-color PNG probing with temp=0 greedy decoding
on qwen3.5-9b-vl-hfq4.mq4:

    input  | RGB-order (pre-fix) | R<->B swap | B<->G swap (this fix)
    -------+---------------------+------------+---------------------
    red    | "Red"   OK          | "Green"    | "Red"   OK
    green  | "Blue"  WRONG       | "Blue"     | "Green" OK
    blue   | "Green" WRONG       | "Red"      | "Blue"  OK

The two wrong mappings (green->blue, blue->green) are a classic G<->B
transposition. Root cause most likely lives in the HuggingFace
patch_embed weight export (input conv channels 1 and 2 appear
transposed); the repair here lives in preprocessing. load_and_preprocess
now writes pixel bytes in R,B,G order into the CHW tensor so the two
transpositions cancel.

End-to-end verification with the same 9B-VL HFQ4 model:

    Image: pngtree-green-leaf PNG (mean RGB ~[36, 55, 28])
    Prompt: "What color is this leaf? Answer in one short sentence."
    Output: "Green."

Adds crates/engine/tests/channel_order.rs pinning the contract with
four pure-color + one mixed-pixel cases. The tests do not require a
GPU and run in the default `cargo test` path.

Fixes Kaden-Schutt#23.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Vision encoder color accuracy: F32 computation path to match HuggingFace BF16 precision

2 participants