Improve TorchAO quantization test coverage and XPU support by jiqing-feng · Pull Request #13530 · huggingface/diffusers

jiqing-feng · 2026-04-21T05:59:22Z

What does this PR do?

This PR improves the TorchAO quantization testing infrastructure with several fixes: enabling int4wo tests on Intel XPU, implementing _dequantize for TorchAO, fixing input dtype mismatches, and fixing training gradient underflow.

Changes

Enable int4wo tests on XPU: Removed the _int4wo_skip marker that restricted int4wo tests to CUDA only, allowing them to run on all accelerator backends.
XPU-specific int4 packing format: Added XPU-specific handling in _get_quant_config() — Intel XPU requires int4_packing_format="plain_int32" for Int4WeightOnlyConfig.
Fix input dtype casting: Introduced _get_dummy_inputs_for_model(model) helper in QuantizationTesterMixin to automatically cast floating-point input tensors to the model's parameter dtype, preventing dtype mismatches during quantized model inference.
Implement _dequantize for TorchAO: Added _dequantize() method in TorchAoHfQuantizer that iterates all nn.Linear modules, calls weight.dequantize() on TorchAOBaseTensor weights, and replaces them with standard nn.Parameter. Also fixed _verify_if_layer_quantized to check isinstance(module.weight, TorchAOBaseTensor) so dequantized layers are correctly detected as non-quantized.
Fix training gradient underflow: Changed autocast dtype from float16 to bfloat16 in _test_quantization_training. Float16's limited dynamic range (max ~65504, min subnormal ~5.96e-8) causes gradients to underflow to zero when passing through quantized tensor subclass operations; bfloat16 shares float32's exponent range and avoids this issue.
Reduce WanAnimate TorchAO test input sizes: Shrunk dummy inputs in TestWanAnimateTransformer3DTorchAo to avoid OOM on devices without FlashAttention (e.g. XPU, which falls back to math SDPA and materializes the full O(S²) attention matrix). Reduced hidden_states from (1,36,21,64,64) to (1,36,5,16,16) and face_pixel_values from (1,3,77,512,512) to (1,3,13,512,512), bringing self-attention sequence length from 21,504 to 320 and peak attention memory from ~74 GiB to ~16 MB. Face frame count (13) is chosen so the face encoder's two stride-2 convolutions produce temporal output 4, plus 1 padding = 5, matching hidden_states temporal dim.

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

jiqing-feng · 2026-04-21T05:59:56Z

Hi @sayakpaul . Would you please review this PR? Thanks!

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

jiqing-feng added 7 commits April 20, 2026 18:55

enable int4wo tests on XPU

bef284c

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

fix typo

ca507a8

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

fix input dtype

c51708e

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

fix int4 config for xpu

6df4b31

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

fix format

8a9013d

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

only int4wo need specific format

81e7015

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

fix config name

4e4e759

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

github-actions bot added tests size/M PR with diff < 200 LOC labels Apr 21, 2026

fix dequantize and training

8180979

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

github-actions bot added quantization size/M PR with diff < 200 LOC and removed size/M PR with diff < 200 LOC labels Apr 21, 2026

jiqing-feng changed the title ~~Enable TorchAO int4 weight-only quantization tests on Intel XPU~~ Improve TorchAO quantization test coverage and XPU support Apr 21, 2026

fix test size

d210d4a

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

github-actions bot added size/M PR with diff < 200 LOC and removed size/M PR with diff < 200 LOC labels Apr 21, 2026

fix size

0ba8682

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

github-actions bot added size/M PR with diff < 200 LOC and removed size/M PR with diff < 200 LOC labels Apr 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve TorchAO quantization test coverage and XPU support#13530

Improve TorchAO quantization test coverage and XPU support#13530
jiqing-feng wants to merge 10 commits intohuggingface:mainfrom
jiqing-feng:torchao

jiqing-feng commented Apr 21, 2026 •

edited

Loading

Uh oh!

jiqing-feng commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jiqing-feng commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Changes

Uh oh!

jiqing-feng commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jiqing-feng commented Apr 21, 2026 •

edited

Loading