Performance decrease compared to base Siglip model without Navit.

by Helder5788 - opened 14 days ago

14 days ago

Hello,
congratulations on your work with Idefics!

I was evaluating this model with NAVIT implementation on FOIL dataset and I got much lower accuracy (CLIPScore) in detecting a foil caption, when using the Idefics2ImageProcessor (longest_edge=980) - A -, than using the model with SiglipImageProcessor (without padding and resizing to 384x384px) - B -.

With the siglip-so400m-14-384-flash-attn2 model, I get similar accuracy as with setup B.

For the text caption features, I used the default SiglipTokenizer.

Do you have any insight why? Is it because this model is not yet trained with NAVIT?

Thanks,

HugoLaurencon

HuggingFaceM4 org 14 days ago

So it seems that you're using the vision encoder of Idefics2, and you're comparing it to the original version of SigLIP (siglip-so400m-14-384-flash-attn2) right?

In this case, yes preserving the aspect ratio and allowing bigger resolution is one difference, but note that we also trained the vision encoder, so the weights are really different from the ones of the original SigLIP.
Therefore, our weights might be optimized to be used with an LM after, and not as a standalone model (or it could require changes in the way it's evaluated).

HugoLaurencon

HuggingFaceM4 org 14 days ago

Or are you using the model uploaded on this repo?

If the accuracy is low, it's normal.
To increase the resolution to 980x980, we added newly initialized positional embeddings.
Therefore, they are not trained and the model uploaded on this repo has not seen any images with a resolution higher than 384x384 during its training.
These new embeddings should be trained to obtain a good accuracy.

Helder5788

14 days ago

Hi,
I used the model in this repo and evaluated using higher resolution images (keeping the aspect ratio). - A -, and compared the performance on the scenario of resizing (and center crop) the images to the native model resolution 384x384 - B -.

I've only used Idefics2 ImageProcessor for padding the image and generating the patch attention mask, in - A -. Regarding the Idefics2 Image Encoder it would not be suitable in my evaluation scenario since it is trained with a different objective (you are not applying the CLIP contrastive loss), I expect that CLIPScores would be less optimized.

If I understand your comments, the model in this repo is not trained after the positional embedding extension (interpolation) and with NAVIT approach.

Thanks for the feedback!

Helder5788 changed discussion status to closed 14 days ago

HugoLaurencon

HuggingFaceM4 org 14 days ago

Yes, the model in this repo should be trained to adapt at least the new positional embeddings (so that they are not random). After that, you should have similar or better performance

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment