CLIP Interrogation Captioning

This script will allow for CLIP-Vit-L (The base clip used in most models) to interrogate a custom vocabulary.

For example this image which was likely in database was given the tags:

flamenco, skirt, tango, cropped, heidel, pant, crop, hula, valedic, skirts

While not intended to be a standalone captioning this can allow you a glimpse at what the text encoder "Sees"

It can also be useful to show the limitations of CLIP model, such as black female celebrity being captioned with oprah, or when a person is smiling the high likelihood of dental and teeth related words.

Primarily I use this as a tool to help identify where the CLIP-L is lacking in training data and how to adjust accordingly.

As one of a handful of people that have Finetuned a CLIP-L, I can tell you that it is far more difficult then a diffusion model.

Limitations

The batch size is set at 128 tokens which is optimal for 8GB cards (Increase for larger VRAM)
The script center crops and resizes the image to 224x224 before the cosine similarity algorithm is run thus cropped appears as a token frequently (Remove from the filtered_output.txt if not desired)
The script saves the top 10 tokens based on cosine similarity (Adjust if needed)

CLIP Interrogation Captioning Script (48k Vocabulary)

CLIP Interrogation Captioning

Limitations