DPTDecoder doesn't work properly with VIT using patch_size=14

Hi,

I wanted to use DPT combined with a fine-tuned ViT-H/14 model on a specific subtask and got a runtime error when giving a batch of 224x224 images to the model : `RuntimeError: The size of tensor a (16) must match the size of tensor b (32) at non-singleton dimension 3`.

I did some further investigation and noticed upsampling did not seem to work as expected, my feature dimensions being computed as 8x8, 16x16, x16x16, 48x48. That's when I noticed that scale factors were computed using a [strange formula](https://github.com/qubvel-org/segmentation_models.pytorch/blob/09b652fd33daba41fe00149cf000d6f7d527c776/segmentation_models_pytorch/decoders/dpt/decoder.py#L252) : `scale_factors = [ stride / 2 ** (i + 2) for i, stride in enumerate(encoder_output_strides)]`. For strides that are not powers of 2 this logically gives non integer results, creating this unexpected behavior.
The problem is that is is impossible to go from 16x16 patches to 1/32th of the input size (which is 7 here) using strides. Maybe scale factors should be computed using the closest power of 2 to the strides ? For instance : 
```python
approx_strides = [2 ** np.round(np.log2(stride)) for stride in encoder_output_strides)]
scale_factors = [stride / 2 ** (i + 2) for i, stride in enumerate(approx_strides)]
```

Or just put a special condition when patch size is 14, as it is the only exception for ViT. I can write a PR depending on the preferred option.

EDIT : With this solution output also needs to be resized to match input size, as it reaches 256x256 at the end of the segmentation head instead of 224x224. Maybe upsampling to input_size instead of using an upsampling factor in the segmentation head is a solution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DPTDecoder doesn't work properly with VIT using patch_size=14 #1328

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

DPTDecoder doesn't work properly with VIT using patch_size=14 #1328

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions