fix(vision): resolve #23 color misidentification + address PR #22 review#35
Open
nickfinease wants to merge 11 commits intoKaden-Schutt:masterfrom
Open
fix(vision): resolve #23 color misidentification + address PR #22 review#35nickfinease wants to merge 11 commits intoKaden-Schutt:masterfrom
nickfinease wants to merge 11 commits intoKaden-Schutt:masterfrom
Conversation
…f16_tiled/gemm_f16_bias dispatch stubs - Removed device_synchronize() every 9 layers (3 syncs → 1 at end) - Added gemm_f16_tiled kernel dispatch (ILP-unrolled, no LDS, correct but no faster than naive) - Added gemm_f16_bias fused kernel dispatch (GEMM+transpose+bias in 1 launch, correct but fewer blocks reduces GPU parallelism — slightly slower on gfx1100) - WMMA kernel (gemm_f16_wmma) has correctness bug — garbage output under some conditions - Color accuracy (green→purple) not yet resolved — preprocessing confirmed correct, likely BF16→F16 weight conversion or vision encoder implementation issue Ref: Kaden-Schutt#21
…g, software tiling adds overhead Benchmarked vit_attention_opt with K_TILE=64, QPB=2: - Naive: 6.9s (data fits in L2 cache, no sync overhead) - Tiled: 8.2s (shared memory loads + syncs add ~20% overhead) Conclusion: both gemm_f16 and vit_attention_f32 are already well-optimized for this workload (784 patches, K=1152-4304). The data is small enough for L2 cache to act as implicit tiling. Software tiling with shared memory adds synchronization overhead that outweighs the benefit. The real path to matching text model performance is quantizing vision weights to HFQ4 format, enabling the existing optimized gemm_hfq4g256_wmma kernels. Ref: Kaden-Schutt#21
Quantizer changes:
- Parse --vision-quant <format> flag (supports: hfq4)
- When --include-vision --vision-quant hfq4 is set, vision weights are
quantized to HFQ4G256/HFQ4G128 instead of stored as F16
- ~4x smaller vision weights (228MB vs 912MB)
Engine changes:
- load_f16_gpu now handles qt=6 (HFQ4G256) and qt=7 (HFQ4G128)
by dequantizing to F32 on CPU before GPU upload
- Added dequant_hfq4g256 and dequant_hfq4g128 helper functions
- Vision encoder still uses F32 weights (dequantized from HFQ4)
and naive gemm_f16 kernel — no GEMM changes needed
Usage:
hipfire-quantize --input ~/models/Qwen3.5-9B --output qwen3.5-9b-vl-hfq4.mq4 \
--format mq4g256 --include-vision --vision-quant hfq4
Note: This dequantizes on CPU during load. Future optimization:
use HFQ4 GPU dequant kernels for on-the-fly dequantization in
the vision GEMM path (same as text model's gemm_hfq4g256_wmma).
Ref: Kaden-Schutt#21
…eights When --vision-quant hfq4 is used, ALL vision tensors (including norms and biases) are stored as HFQ4G128. load_f32_gpu previously asserted qt=1 or qt=2, causing a panic on qt=7 (HFQ4G128). Now handles qt=6 (HFQ4G256) and qt=7 (HFQ4G128) via CPU dequantization. KNOWN ISSUE: HFQ4 vision weights produce garbage output (! chars). The dequantization math is verified correct (unit tests pass). Issue is likely in the HFQ file's data layout for vision tensors — the cumulative data_offset calculation may be wrong for HFQ4 tensors when mixed with MQ4 text tensors in the same file. Ref: Kaden-Schutt#21
…F16 for gemm_f16
ROOT CAUSE of garbage output was twofold:
1. Dequant used qt-based block sizes but file had different gs than expected
2. Dequant uploaded F32 data but gemm_f16 kernel reads F16 bytes
Fix: use info.group_size for block size, convert dequant F32->F16 before upload.
BENCHMARK (qwen3.5-9b-vl, 448px image, 256 max_tokens):
F16 vision: 7.5s warm, 10.4 GB file
HFQ4 vision: 5.0s warm, 5.5 GB file
Speedup: 1.5x faster, 47% smaller file
Usage:
hipfire-quantize --input ~/models/Qwen3.5-9B \
--output qwen3.5-9b-vl-hfq4.mq4 \
--format mq4g256 --include-vision --vision-quant hfq4
Closes: Kaden-Schutt#21
Adds BF16 storage format (qt=16) for vision weights. When --vision-format
bf16 is used with --include-vision, BF16 source weights are stored as-is
(no BF16->F16 conversion). Non-BF16 sources fall back to F16.
Also adds:
- bf16_to_f32() conversion function in vision loader
- load_f16_gpu now outputs F32 for all formats (F16->F32, BF16->F32, HFQ4->F32)
- linear_f16 uses gemm_f32_batched instead of gemm_f16 (F32 compute path)
STATUS: Color accuracy not yet fixed. Intermediate tensor comparison shows
values match HuggingFace through pos_embed but diverge at layer 0 despite
matching weights. Root cause unknown — needs HuggingFace reference comparison
or community research for known Qwen3.5 vision issues.
Usage:
hipfire-quantize --input ~/models/Qwen3.5-9B \
--output qwen3.5-9b-vl-bf16.mq4 \
--format mq4g256 --include-vision --vision-format bf16
Ref: Kaden-Schutt#23
…ProcessorFast Replaces hardcoded resize_exact(448, 448) with smart_resize using factor=28 and min/max pixel bounds. For a 360x360 image: Before: 448x448 (784 patches, 28x28 grid) — WRONG After: 352x352 (484 patches, 22x22 grid) — matches HuggingFace The factor=28 ensures the image grid is always a multiple of spatial_merge_size (2), preventing the panic from uneven patch counts. Color accuracy: still purple. The resolution fix was necessary but not sufficient. Root cause of color shift remains unknown — needs HuggingFace reference comparison at layer 0 to identify the divergence. Ref: Kaden-Schutt#23
…lash Neither kernel was referenced via ensure_kernel dispatch; both were stale experimental variants superseded by the active vit_attention and gemm_f16 / gemm_f16_tiled paths. Carries no runtime impact. Addresses PR Kaden-Schutt#22 review comment re: dead kernel files.
load_f16_gpu was passing byte-length (data.len() / f16_bytes.len()) as the shape array to gpu.upload_raw, so downstream shape-aware dispatch saw a tensor shaped [byte_count] instead of [element_count]. Corrected both the F16-direct and HFQ4-dequant paths to pass the computed element count n. Addresses PR Kaden-Schutt#22 review feedback on qwen35_vl.rs:118.
The HFQ4 vision path (qt=6 G256, qt=7 G128) auto-dequantizes weights
to F16 at load time because there's no GPU HFQ4 kernel for vision
yet. Without visibility into this, an HFQ4 model looks identical to
an F16 model at load, making it hard to attribute perf or memory
differences during debugging. Now prints a single line at load:
vision weight format: HFQ4-G256 (dequanting to F16 on load)
Addresses PR Kaden-Schutt#22 review request for better weight-format
observability.
Kaden-Schutt#23) Empirical finding from pure-color PNG probing with temp=0 greedy decoding on qwen3.5-9b-vl-hfq4.mq4: input | RGB-order (pre-fix) | R<->B swap | B<->G swap (this fix) -------+---------------------+------------+--------------------- red | "Red" OK | "Green" | "Red" OK green | "Blue" WRONG | "Blue" | "Green" OK blue | "Green" WRONG | "Red" | "Blue" OK The two wrong mappings (green->blue, blue->green) are a classic G<->B transposition. Root cause most likely lives in the HuggingFace patch_embed weight export (input conv channels 1 and 2 appear transposed); the repair here lives in preprocessing. load_and_preprocess now writes pixel bytes in R,B,G order into the CHW tensor so the two transpositions cancel. End-to-end verification with the same 9B-VL HFQ4 model: Image: pngtree-green-leaf PNG (mean RGB ~[36, 55, 28]) Prompt: "What color is this leaf? Answer in one short sentence." Output: "Green." Adds crates/engine/tests/channel_order.rs pinning the contract with four pure-color + one mixed-pixel cases. The tests do not require a GPU and run in the default `cargo test` path. Fixes Kaden-Schutt#23.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Supersedes #22. Rebased onto current
upstream/masterand addresses the review feedback there, plus lands the root-cause fix for issue #23 (VL model misidentifies green/blue objects).Branch history (11 commits on top of
upstream/master):Issue #23 — green/blue swap (the headline fix)
Empirical probing with pure-color PNGs,
temp=0greedy decoding,qwen3.5-9b-vl-hfq4.mq4:The two wrong mappings are a classic G↔B transposition. Root cause is most likely a channel permutation in the HuggingFace
patch_embedweight export (input conv channels 1 and 2 appear transposed) — I did not chase that all the way into the export pipeline, but the preprocessing swap here resolves the end-to-end symptom and matches the pure-color ground truth cleanly.End-to-end verification
On the rebased branch with the fix applied:
Pre-fix, the same image on the same model produced "Blue."
Regression test
crates/engine/tests/channel_order.rspins the contract with pure-color + mixed-pixel fixtures. Runs without a GPU — part of the defaultcargo testpath. If someone removes the swap, these assertions fail before any vision inference runs.PR #22 review follow-ups
The review blockers on #22 that I believe are addressed here:
gemm_f16_wmma_tiled.hipandvit_attention_flash.hipremoved. Neither was referenced viaensure_kerneldispatch; both were stale experimental variants (commit6fcdddc).qwen35_vl.rs:118—load_f16_gpuwas passingdata.len()(bytes) as the tensor shape on both the F16-direct and HFQ4-dequant paths. Corrected to pass the element countn. The F16-direct path at line 110 had the same bug and is included in the fix (commit872c5ba).vision weight format: HFQ4-G256 (dequanting to F16 on load)) so HFQ4 and F16 models can be told apart during debugging (commit4e637e8).upstream/master— one manual conflict incrates/rdna-compute/src/dispatch.rswhere this branch'svit_attention_optlanded at the same spot as upstream's newattention_dflash_f32. Resolved by keeping both as siblings.Outstanding from #22 review (not in this PR):
gemm_f16_wmmacorrectness claim — the original Vision encoder: HFQ4 quantization, sync optimization, 1.5x speedup #22 description mentioned a wmma correctness fix. I couldn't reproduce a failing test against the currentgemm_f16_wmmapath on this branch and don't have a concrete reproducer to hand off. I'd rather retract the claim than ship it unexplained; happy to investigate further if you can point me at the failure mode you had in mind.Gates
cargo test -p engine --test channel_order— 4/4 pass.cargo build --release --features deltanet -p engine --example daemon --example infer— clean (pre-existing warnings only)../scripts/coherence-gate.shonqwen3.5-9b.mq4— text-only reasoning path OK post-rebase (/tmp/coherence-20260420-090512.md).0.8band4bskipped (not present in my models dir).cargo clippy -p engine --tests --features deltanet --no-deps— no new errors or warnings attributable to this PR../scripts/speed-gate.sh --fast— skipped (4B model not present locally). Expect CI / a tester withqwen3.5-4b.mq4available to confirm no regression.Tested on
gfx1100/ RX 7900 XTX / ROCm 7.2.Test plan
e9c50c1locally, re-run the pure-color matrix onqwen3.5-9b-vl-hfq4.mq4(or whichever VL model reproduced Vision encoder color accuracy: F32 computation path to match HuggingFace BF16 precision #23 for you) — confirm red/green/blue all identified correctly.infer --imageand confirm the color word is now correct.cargo test -p engine --test channel_orderto confirm the preprocessing contract passes on your box.patch_embedweight export to confirm (or rule out) the channel-permutation root cause, so the fix can eventually move out of preprocessing and into the weight converter.Closes #23.