home models images videos articles comics bounties challenges updates shop
LTX-2.3 Image-to-Video Workflow — QwenVL Auto-Prompt, No Drift

Name: LTX-2.3 Image-to-Video Workflow — QwenVL Auto-Prompt, No Drift
Rating: 5 (8 reviews)
Author: TP_AI_63
Updated: Jun 25, 2026
tool
ltx-video stock-footage gguf i2v workflow
Download
1 variant available
Config Other
24.58 KB
Verified: 15 hours ago
Download (24.58 KB)
This checkpoint includes a config file, download and place it along side the checkpoint.
Details
Type
Workflows
Stats
Reviews
Positive
(8)
Published
Jun 25, 2026
Base Model
LTXV 2.3
Hash
AutoV2
5F070C3A3A
About this version
default creator card background decoration
TP_AI_63
License:
LTXV2
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
🎬 **LTX-2.3 Image-to-Video Workflow**
QwenVL Auto-Prompt · No Drift · ComfyUI
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Pure LTX 2.3 22B image-to-video pipeline for ComfyUI. Drop an image, get professional motion. QwenVL vision model automatically analyzes your input image and generates a motion-aware prompt—no manual description needed. The workflow enforces locked static camera (anti-drift), scales dynamically to any input resolution, and upscales output to broadcast-quality 1920×1088 at 24 FPS. Production-ready for stock footage, ambient loops, and commercial video generation.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

✨ **Features**
✅ QwenVL Auto Motion Director — Vision model reads input image → auto-generates motion prompt with camera lock and object tracking hints
✅ Locked Static Camera — Zero pan, zoom, or drift; all motion in-frame only
✅ Pure LTX 2.3 22B — No LoRA needed; GGUF quantization for 16GB VRAM
✅ Dynamic Pixel Scaling — Auto-scales any input size to optimal 0.52MP for 8-step inference
✅ Dual-Stage Upscale — 960×544 base → 2× spatial upscaler → 1920×1088 output
✅ Audio + Video VAE — Multi-modal encoding; ready for synced audio pipelines
✅ 24 FPS Native — Smooth playback; 168 frames per generation

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

📦 **Required Models** (6 files, ~32 GB)

• ltx-2.3-22b-distilled-Q4_K_M.gguf (17.8 GB) — Main UNet diffusion model (GGUF Q4 quantized)
• gemma_3_12B_it_fp4_mixed.safetensors (9.45 GB) — Text encoder for LTX prompt understanding
• ltx-2.3_text_projection_bf16.safetensors (2.31 GB) — Text-to-latent projection layer
• LTX23_video_vae_bf16.safetensors (1.45 GB) — Video VAE codec (encode/decode video frames)
• LTX23_audio_vae_bf16.safetensors (365 MB) — Audio VAE codec (dual-modal support)
• ltx-2.3-spatial-upscaler-x2-1.1.safetensors (996 MB) — 2× spatial upscaler for final quality pass

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

⬇️ **Download Links** (verified HuggingFace)

📁 **ComfyUI/models/unet/**
• LTX-2.3-22B-distilled-1.1-Q4_K_M.gguf (17.8 GB) — https://huggingface.co/QuantStack/LTX-2.3-GGUF

📁 **ComfyUI/models/text_encoders/**
• gemma_3_12B_it_fp4_mixed.safetensors (9.45 GB) — https://huggingface.co/Comfy-Org/ltx-2
• ltx-2.3_text_projection_bf16.safetensors (2.31 GB) — https://huggingface.co/Kijai/LTX2.3_comfy

📁 **ComfyUI/models/vae/**
• LTX23_video_vae_bf16.safetensors (1.45 GB) — https://huggingface.co/Kijai/LTX2.3_comfy
• LTX23_audio_vae_bf16.safetensors (365 MB) — https://huggingface.co/Kijai/LTX2.3_comfy

📁 **ComfyUI/models/upscale_models/**
• ltx-2.3-spatial-upscaler-x2-1.1.safetensors (996 MB) — https://huggingface.co/Lightricks/LTX-2.3

⚠️ *VAE files are NOT in the official Lightricks repo — get them from Kijai/LTX2.3_comfy. Gemma fp4 encoder hosted by Comfy-Org. Filenames use v1.1 (current stable hotfix release).*

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

🧩 **Required Custom Nodes**

• LTXV — Lightricks LTX-Video extension (sampling, encoding, projection)
• AILab_QwenVL_Advanced — QwenVL vision model integration for image-to-text
• ComfyUI-GGUF — UnetLoaderGGUF for quantized model loading
• VideoHelperSuite — VHS_VideoCombine, frame batching, video output export
• rgthree-comfy — Fast Groups Bypasser (optional; used for workflow flexibility)
• ImageIterator — Batch image loader for multi-image workflows
• ImageScaleToTotalPixels — Dynamic resolution scaling to pixel budget
• GetImageSize+ — Image dimension detection for auto-scaling pipeline

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

🚀 **How to Use**

1. Place your input image(s) in the ComfyUI ./input directory
2. Load this workflow into ComfyUI
3. (Optional) Review the auto-generated motion prompt in the QwenVL output text node
4. Queue and generate
5. Output video saved via VHS to ./output directory

The entire motion prompt generation and scaling pipeline runs automatically—queue once, get your result.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

⚙️ **Settings & Parameters**

• FPS — 24 (Standard frame rate; 168 total frames per generation)
• Pixel Budget — 0.52 MP (Optimal for 8-step sampling on 16GB VRAM)
• Sampler — er_sde (Low-drift SDE solver for stable motion)
• Base Steps — 8 (Main diffusion sampling passes)
• Refine Steps — 3 (Quality refinement after upscale)
• CFG Scale — 1.0 (Classifier-free guidance; 1.0 = no guidance, stable output)
• Output Resolution — 1920×1088 (After 2× spatial upscale)

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

💡 **Performance Tips**

• Batch Multiple Images — Queue 5–10 images in one session to amortize model load time
• Input Image Quality — Sharp, well-lit images yield sharper motion; low-contrast images may produce soft motion
• Motion Prompt Tuning — Edit the QwenVL text output node before queuing if you want specific motion direction (e.g., remove camera keywords to force static)
• Speed vs. Quality — The dual-stage upscale adds ~20 seconds per clip. Bypass the Spatial Upscaler node if speed is critical (output at 960×544)

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

📝 **Notes & AI Disclosure**

• AI-Generated Content — All example outputs are AI-generated by LTX 2.3. Suitable for stock footage, ambient loops, and creative projects.
• Model Downloads — See the "Download Links" section above for exact HuggingFace repos and target folders.
• Hardware Tested — RTX 5080 16GB VRAM; CUDA compute 9.2+
• VRAM Usage — ~14 GB peak during sampling; requires fast SSD for frame buffering
• No Commercial Guarantees — Use at your own discretion. Respect local AI disclosure laws when publishing outputs.

Enjoy clean, drift-free motion generation. Questions? Test the workflow locally first—Civitai comments section is for feedback, not troubleshooting.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

⚖️ **Model Attribution & Licensing**

**LTX-Video 2.3** (Lightricks)
• License: LTX-2 Community License — https://huggingface.co/Lightricks/LTX-2.3/blob/main/LICENSE
• Free for commercial use by entities under $10M USD annual revenue
• AI-generated content disclosure required

**Gemma 3 12B IT** (Google DeepMind)
• License: Gemma Terms of Use — https://ai.google.dev/gemma/terms
• Subject to Google's Prohibited Use Policy

**Custom Nodes**
• LTXV (Lightricks), VideoHelperSuite (MIT), AILab QwenVL, rgthree-comfy (MIT), ComfyUI-GGUF

All example outputs are AI-generated. This workflow (JSON configuration) is shared as original work; model weights must be downloaded separately from the official sources above.