A simple tool for generating image captions in batches using large language models (LLMs).

Some time ago, I encountered the challenge of creating captions for a vast collection of images, which I needed for LoRa training and fine-tuning machine learning models. At first, I experimented with various tools and workflows to address this need. One of the early options I explored was "Florence2" in a ComfyUI workflow. While functional, it quickly became apparent that it lacked the flexibility required for my specific use case. Additionally, certain restrictions made it less suitable for the level of customization I needed.

My search continued, and I came across the Joy Caption project, which offered significant improvements and demonstrated great potential. However, despite its strengths, the workflow still felt somewhat cumbersome and rigid, making it difficult to fully adapt to my needs.

Determined to find the perfect solution, I delved deeper into the available options, researching various iterations of Python scripts and applications designed to perform similar tasks. Each tool offered unique features and approaches, but none provided a complete solution that aligned with my vision. I found myself repeatedly wishing for a tool that combined the best aspects of these different solutions into one cohesive, flexible, and efficient package.

This realization led me to take matters into my own hands. Drawing inspiration from the strengths of all the tools I explored, I developed a custom solution that does exactly what I needed. This script is the culmination of extensive research and practical testing in this area. It not only streamlines the process of generating captions for images but also empowers users with the flexibility to adapt it to their unique prompts. This tool is designed to save time, enhance efficiency, and deliver results.

A Python CLI script that generates caption files for all images within a specified folder. It saves the captions using the same filename as the corresponding image, with a .txt extension, either in the same folder or in the directory specified by the output_dir argument. The script will not create captions for images that already have a corresponding caption file in the output_dir.

https://huggingface.co/alcaitiff/LLM-CAPTION

The installation is very simple:

Clone the repository

git clone https://huggingface.co/alcaitiff/LLM-CAPTION

Enter the directory

cd LLM-CAPTION

Activate the virtual environment

python3 -m venv ./venv

source venv/bin/activate

Download the dependencies

pip install -r requirements.txt

Dependencies

Google SigLIP (3.5GB) will be downloaded automatically from https://huggingface.co/google/siglip-so400m-patch14-384
Uncensored LEXI LAMA Llama-3.1-8b-Instruct (5.5GB) will be downloaded automatically from https://huggingface.co/John6666/Llama-3.1-8B-Lexi-Uncensored-V2-nf4
The Joy Caption model is on the checkpoint folder

Usage

#EX1
python3 ./caption.py ./test 

#EX2
python3 ./caption.py ./test \
--prompt "Describe this image in detail within 50 words." \
--output_dir /tmp/caption

Default prompt

In one paragraph, write a very descriptive caption for this image, describe all objects, characters and their actions, describe in detail what is happening and their emotions. Include information about lighting, the style of this image and information about camera angle within 200 words. Don't create any title for the image.

Example

Prompt used

In one paragraph write a very descriptive caption for this image, describe all objects, characters and their actions, be explicit and detailed, describe in detail what is happening and their emotions. Include information about lighting, the style of this image and information about camera angle within 150 words.

Result

In this whimsical cartoon, a serene sunset unfolds on the horizon as a bald, bespectacled man sits on a worn, wooden bench, gazing out at the breathtaking view. His arms are crossed, and his face is a picture of dejection, as if he's been pondering life's mysteries for hours. Beside him, perched on the edge of the bench, is a wise, big-eyed owl, its feathers a mesmerizing mosaic of browns and grays. The owl's head is cocked to one side, as if listening intently to the man's woes. The sky above is ablaze with hues of orange, pink, and purple, with a few wispy clouds drifting lazily across the canvas. The sun, a burning orange disk, slowly dips below the horizon, casting a warm, golden light over the scene. The man's speech bubbles read, "Why is happiness so elusive?" and "Because you look for it in the wrong places, including the future instead of the present." The owl's response, "Because you look for it in the wrong places, because you look for it in the future instead of the present," is a poignant reminder to live in the moment. The style of this image is reminiscent of classic cartoons from the 1950s and 60s, with bold lines, vibrant colors, and a sense of nostalgia. The camera angle is a wide, sweeping shot, taking in the entire scene from a low point

Note that you can change the prompt like you want, and for each use you may need a different one. This tool may be useful to detect poisoned images as well, but this is another story.