Qwen Image VAE
Full FP32 Training of Decoder
Works in ComfyUI
Feel free to to suggest onsite support, to civitai staff. I don't think they have any agreements like with FLUX
Overview
This model is a fine-tuned variant of the base Qwen Image VAE, modified to emphasize high-frequency detail preservation and expanded color representation, following an HDR-style reconstruction objective.
The evaluation compares the base and HDR-tuned models using perceptual, structural, distributional, and photometric metrics over identical input data.
Evaluation Summary
Perceptual Fidelity (LPIPS)
Base: 0.0177
HDR: 0.0786
The HDR model exhibits a significant increase in perceptual distance, indicating reduced strict identity reconstruction under deep feature similarity metrics and a shift toward detail-enhancing reconstruction behavior.
Structural Energy (Gradient Magnitude)
Ground Truth: 404.02 (both models)
Base Reconstruction: 313.46
HDR Reconstruction: 687.97
The base model demonstrates strong low-pass behavior with reduced high-frequency content. In contrast, the HDR model exhibits high-frequency amplification, exceeding the structural energy of the original inputs.
Color Distribution Support
Ground Truth: 33150.61 (both models)
Base Reconstruction: 35004.49
HDR Reconstruction: 40133.37
The HDR model produces a substantially expanded color support space, indicating increased chromatic dispersion and reduced quantization collapse.
Photometric Stability
Brightness Bias
Base: 0.000351
HDR: 0.0000098
Contrast Gain
Base: 0.9984
HDR: 0.99999
Both models preserve global photometric consistency, with the HDR variant showing near-perfect affine stability.
Channel Drift
Red Shift:
Base: +0.0116
HDR: +0.0104
Green Shift:
Base: -0.0606
HDR: -0.1856
Blue Shift:
Base: +0.0187
HDR: +0.0219
The HDR model introduces a significantly stronger negative bias in the green channel, while maintaining comparable red and blue stability.
Interpretation
The base Qwen VAE behaves as a contractive perceptual projection operator, prioritizing smooth reconstructions and suppression of high-frequency components.
The HDR-tuned variant transitions into a detail-amplifying reconstruction operator, characterized by:
Increased high-frequency energy
Expanded color manifold coverage
Higher perceptual divergence under LPIPS
Preserved global photometric invariance
This represents a functional shift from a smoothing autoencoder regime toward a high-frequency preserving (HDR-like) reconstruction regime.


