Sign In

CLIP Interrogation Captioning Script (48k Vocabulary)

11

Apr 6, 2025

(Updated: 2 hours ago)

tool guide
CLIP Interrogation Captioning Script (48k Vocabulary)

CLIP Interrogation Captioning

This script will allow for CLIP-Vit-L (The base clip used in most models) to interrogate a custom vocabulary.

For example this image which was likely in database was given the tags:

  • flamenco, skirt, tango, cropped, heidel, pant, crop, hula, valedic, skirts

While not intended to be a standalone captioning this can allow you a glimpse at what the text encoder "Sees"

It can also be useful to show the limitations of CLIP model, such as black female celebrity being captioned with oprah, or when a person is smiling the high likelihood of dental and teeth related words.

Primarily I use this as a tool to help identify where the CLIP-L is lacking in training data and how to adjust accordingly.

As one of a handful of people that have Finetuned a CLIP-L, I can tell you that it is far more difficult then a diffusion model.

Limitations

  • The batch size is set at 128 tokens which is optimal for 8GB cards (Increase for larger VRAM)

  • The script center crops and resizes the image to 224x224 before the cosine similarity algorithm is run thus cropped appears as a token frequently (Remove from the filtered_output.txt if not desired)

  • The script saves the top 10 tokens based on cosine similarity (Adjust if needed)

11