Skip to content

DPTDecoder doesn't work properly with VIT using patch_size=14 #1328

@schwobr

Description

@schwobr

Hi,

I wanted to use DPT combined with a fine-tuned ViT-H/14 model on a specific subtask and got a runtime error when giving a batch of 224x224 images to the model : RuntimeError: The size of tensor a (16) must match the size of tensor b (32) at non-singleton dimension 3.

I did some further investigation and noticed upsampling did not seem to work as expected, my feature dimensions being computed as 8x8, 16x16, x16x16, 48x48. That's when I noticed that scale factors were computed using a strange formula : scale_factors = [ stride / 2 ** (i + 2) for i, stride in enumerate(encoder_output_strides)]. For strides that are not powers of 2 this logically gives non integer results, creating this unexpected behavior.
The problem is that is is impossible to go from 16x16 patches to 1/32th of the input size (which is 7 here) using strides. Maybe scale factors should be computed using the closest power of 2 to the strides ? For instance :

approx_strides = [2 ** np.round(np.log2(stride)) for stride in encoder_output_strides)]
scale_factors = [stride / 2 ** (i + 2) for i, stride in enumerate(approx_strides)]

Or just put a special condition when patch size is 14, as it is the only exception for ViT. I can write a PR depending on the preferred option.

EDIT : With this solution output also needs to be resized to match input size, as it reaches 256x256 at the end of the segmentation head instead of 224x224. Maybe upsampling to input_size instead of using an upsampling factor in the segmentation head is a solution.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions