SD3.5L and M Don't Like Long Prompts (Partially)
Introduction
(Preface, i used an LLM to re-write my ramblings, so it's easier to read. TLDR: workflow in attachments)
The focus here is on SD3.5L and M and their ability to handle multiple prompts. I’ve always been intrigued by the predecessor SDXL and its dual-clip system (CLIPG/CLIPL), which allowed for multiple prompts. However, with the release of SD3, the interest resurfaced, now enhanced with T5. Despite some decent outputs, my experiments with SD3 didn’t yield great results, especially with human figures (Who didn't?). The release of SD3.5 brought new hope, but soon I noticed that longer T5 prompts degraded image quality quickly. This led me to explore solutions.
(using this workflow i get results 1 and results 2 and results 3 and results 4)
The Problem
SD3.5L and M are smart, but they struggle with long T5 prompts.
Image quality drops fast when T5 prompts are too elaborate, which contradicts the use of T5 properly.
I hypothesized that the issue lies with CLIPG and CLIPL not handling long T5 prompts well, as they are designed for 75 tokens maximum.
The Solution
To address this, I used an LLM (Large Language Model) to:
Separate the prompt parts: Break down long T5 prompts into concise segments for CLIPG and CLIPL.
Enhance the prompt: Use the LLM to enrich the input text or reinterpret it to fit SD3.5 constraints using JSON as output format.
Art style selection: Add an option to choose an art style for the final output.
Implementation
LLM Prompting: I nudged the LLM to steer toward generating raw JSON, which allowed for structured prompt separation.
Nodes:
Take raw text input and send to Ollama. Set the Ollama timer to 0, so the LLM model will be unloaded immediatly after generating prompt.
Either enhance the image prompt or reinterpret it to maintain SD3.5 constraints.
Ensure T5 gets a long, elaborate prompt, while CLIPG/CLIPL get short, concise ones.
After generating image, unload SD3.5 model so Ollama can use all VRAM for LLM inference again.
Requirements
Ollama, with biggest model your GPU can fit reasonably, you're going to need ~2000-4000 context size.
ClownShark Sampler. I've been telling him to put it in the ComfyUI manager, but it still isn't there. One of the better samplers out there, if you like tinkering.
The rest of the nodes should all be available from the ComfyUI manager itself.
LLM Differences
Prompt Enrichment: Lighter LLM models can be used to enrich and seperate prompts, but only so much.
Dynamic Inputs: Larger or reasoning-capable LLMs can process complex inputs like lyrics or chat logs to generate prompts.
