Skip to content

Vision encoder color accuracy: F32 computation path to match HuggingFace BF16 precision #23

@nickfinease

Description

@nickfinease

Problem

Vision encoder produces inaccurate color descriptions (green leaves described as purple). Structural descriptions (shape, texture, layout) are correct.

Root cause (diagnosed)

HuggingFace runs Qwen3.5 vision encoder in BF16 arithmetic. Hipfire runs in F16 arithmetic. The different rounding behavior during matrix multiply and attention computations accumulates over 27 transformer layers, shifting feature magnitudes that encode color information.

Diagnostic proof (intermediate tensor comparison against HuggingFace PyTorch reference):

  • Preprocessed tensor: Channel means match exactly — MATCH
  • patch_embed output: Diff < 0.0001 — MATCH
  • pos_embed output: Diff < 0.0001 — MATCH
  • After transformer layer 0: 10-30% relative error on feature values — DIVERGE

The weights themselves are correct. The divergence starts at layer 0 where the F16 computation path produces different rounding than BF16.

Why structure is correct but color is wrong

  • Spatial structure is encoded in attention patterns (softmax-normalized, robust to small errors)
  • Color is encoded in feature magnitudes (sensitive to per-operation rounding differences)
  • The F16 vs BF16 rounding difference is tiny per operation (~0.001%) but compounds through 27 layers x millions of operations

Proposed solution: F32 computation path

  1. Store vision weights as BF16 in the HFQ file (original format, no conversion)
  2. On load, convert BF16 to F32 (lossless — BF16 is a subset of F32)
  3. Compute in F32 using existing gemm_f32_batched kernel

F32 has 23 mantissa bits vs BF16's 7. This is MORE precise than HuggingFace's BF16 path, so results would be at least as accurate.

Tradeoffs

  • F16 (current): Color wrong, structure correct, fast GEMM, 0.9GB VRAM
  • F32 (proposed): Color correct, structure correct, ~2x slower GEMM, 1.8GB VRAM

Estimated total vision forward impact: ~2x slower than current F16 path. For a 7.5s inference, this would be ~15s. Still faster than not having VL at all.

Alternative: BF16 native kernel

A gemm_bf16 kernel that does BF16 matmul natively would match HuggingFace exactly AND maintain F16-level speed. But this requires:

  • New C type (__bf16) in the kernel
  • Different GPU arithmetic instructions
  • Changes to kernel compiler, dispatch layer, weight loader
  • Hardware BF16 support (available on gfx1100/RDNA3, NOT on gfx1030/RDNA2)

This is a larger effort but the ideal long-term solution.

Implementation scope

F32 path (simple, ~50 lines):

  • Add --vision-format bf16 to quantizer (already partially implemented)
  • Add BF16 to F32 conversion in load_f16_gpu
  • Change vision encoder to use gemm_f32_batched instead of gemm_f16

BF16 native kernel (complex, ~300 lines):

  • New gemm_bf16.hip kernel
  • New dispatch function
  • New weight loader path
  • Hardware capability detection

Recommendation

Implement the F32 path first as a --vision-compute f32 flag. Measure the actual speed impact. If acceptable, make it the default for VL models. If too slow, invest in the BF16 native kernel.

Ref: #22
Ref: #21

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions