Vision encoder color accuracy: F32 computation path to match HuggingFace BF16 precision

## Problem

Vision encoder produces inaccurate color descriptions (green leaves described as purple). Structural descriptions (shape, texture, layout) are correct.

## Root cause (diagnosed)

HuggingFace runs Qwen3.5 vision encoder in BF16 arithmetic. Hipfire runs in F16 arithmetic. The different rounding behavior during matrix multiply and attention computations accumulates over 27 transformer layers, shifting feature magnitudes that encode color information.

Diagnostic proof (intermediate tensor comparison against HuggingFace PyTorch reference):

- Preprocessed tensor: Channel means match exactly — MATCH
- patch_embed output: Diff < 0.0001 — MATCH
- pos_embed output: Diff < 0.0001 — MATCH
- After transformer layer 0: 10-30% relative error on feature values — DIVERGE

The weights themselves are correct. The divergence starts at layer 0 where the F16 computation path produces different rounding than BF16.

## Why structure is correct but color is wrong

- Spatial structure is encoded in attention patterns (softmax-normalized, robust to small errors)
- Color is encoded in feature magnitudes (sensitive to per-operation rounding differences)
- The F16 vs BF16 rounding difference is tiny per operation (~0.001%) but compounds through 27 layers x millions of operations

## Proposed solution: F32 computation path

1. Store vision weights as BF16 in the HFQ file (original format, no conversion)
2. On load, convert BF16 to F32 (lossless — BF16 is a subset of F32)
3. Compute in F32 using existing gemm_f32_batched kernel

F32 has 23 mantissa bits vs BF16's 7. This is MORE precise than HuggingFace's BF16 path, so results would be at least as accurate.

## Tradeoffs

- F16 (current): Color wrong, structure correct, fast GEMM, 0.9GB VRAM
- F32 (proposed): Color correct, structure correct, ~2x slower GEMM, 1.8GB VRAM

Estimated total vision forward impact: ~2x slower than current F16 path. For a 7.5s inference, this would be ~15s. Still faster than not having VL at all.

## Alternative: BF16 native kernel

A gemm_bf16 kernel that does BF16 matmul natively would match HuggingFace exactly AND maintain F16-level speed. But this requires:

- New C type (__bf16) in the kernel
- Different GPU arithmetic instructions
- Changes to kernel compiler, dispatch layer, weight loader
- Hardware BF16 support (available on gfx1100/RDNA3, NOT on gfx1030/RDNA2)

This is a larger effort but the ideal long-term solution.

## Implementation scope

F32 path (simple, ~50 lines):
- Add --vision-format bf16 to quantizer (already partially implemented)
- Add BF16 to F32 conversion in load_f16_gpu
- Change vision encoder to use gemm_f32_batched instead of gemm_f16

BF16 native kernel (complex, ~300 lines):
- New gemm_bf16.hip kernel
- New dispatch function
- New weight loader path
- Hardware capability detection

## Recommendation

Implement the F32 path first as a --vision-compute f32 flag. Measure the actual speed impact. If acceptable, make it the default for VL models. If too slow, invest in the BF16 native kernel.

Ref: https://github.com/Kaden-Schutt/hipfire/pull/22
Ref: https://github.com/Kaden-Schutt/hipfire/issues/21


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vision encoder color accuracy: F32 computation path to match HuggingFace BF16 precision #23

Problem

Root cause (diagnosed)

Why structure is correct but color is wrong

Proposed solution: F32 computation path

Tradeoffs

Alternative: BF16 native kernel

Implementation scope

Recommendation

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Vision encoder color accuracy: F32 computation path to match HuggingFace BF16 precision #23

Description

Problem

Root cause (diagnosed)

Why structure is correct but color is wrong

Proposed solution: F32 computation path

Tradeoffs

Alternative: BF16 native kernel

Implementation scope

Recommendation

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions