Download
1 variant available
## Technical Guide: Training ACE-Step 1.5 LoRA for Psytrance on an RTX 3090## 1.
Dataset Preprocessing & Surgical Slicing
Standard audio slicing methods destroy the rhythm and phase alignment of dense audio
material like Psytrance. Follow these strict preprocessing rules:
* Zero-Crossing Slicing: Cuts must occur exactly at zero-crossing points ($0\text{ dB}$
amplitude) to avoid digital clicks, pops, and phase cancellation.
* Zero Fade/Crossfade Rule: Never use standard fade-ins or fade-outs. The diffusion model
will interpret this as a musical instruction and learn to fade the track out every 30 seconds.
* 1-Second Crossfade Overlap: Implement a 1-second crossfade overlap between chunks to
maintain continuity across sample boundaries.
* Fixed Chunk Length: Slice the source material into exact 30.0-second segments. This
captures complete musical phrases while fitting comfortably into the 24 GB VRAM limit.
* Format Constraints: Export all slices at 44.1 kHz, 16-bit or 24-bit PCM WAV. Avoid MP3
compression to prevent codec artifacts from muddying the high frequencies.
## 2. Text-Caption Tagging Strategy
Audio diffusion models require metadata to isolate tempo and key. Each audio slice requires
a matching .txt file with identical naming.
* BPM & Key Isolation: Explicitly tag the precise BPM and musical key (e.g., 142 bpm, G#
minor). This prevents the model from blending different tempos and scales into a dissonant
mix.
* Sub-Genre Descriptor: Start every caption with a unified anchor tag (e.g., psytrance track).
* Structural Elements: Document specific sonic elements present in that chunk (e.g., rolling
triplet bassline, punchy energetic kickdrum, sharp acid synth leads, rhythmic percussion,
crisp hi-hats).
* Quality Tokens: Append production quality tags at the end of the text file (e.g., studio
master quality, clean professional mix).
## 3. Training Hyperparameters & VRAM Optimization (RTX 3090)
To maximize the 24 GB VRAM of an RTX 3090 without triggering CUDA out of memory
errors, use these exact network dimensions and pipeline settings:
## Network Architecture (LoRA)
* LoRA Rank ($r$): 64 (Provides sufficient capacity to map distinct keys and tempos into
separate internal slots).
* LoRA Alpha: 32 (Ensures stable weight scaling).
* LoRA Dropout: 0.05 (Prevents overfitting while retaining rapid pattern recognition).
* Target Modules: ["to_q", "to_k", "to_v", "to_out.0", "ff.net.0.proj", "ff.net.2"]
## Optimization & Precision
* Mixed Precision: bf16 (Mandatory for modern GPU compute stability).
* Optimizer: bitsandbytes 8-bit AdamW (Compresses the optimizer states to halve VRAM
allocation).
* Gradient Checkpointing: True (Recomputes activations during the backward pass to save
massive amounts of VRAM).
* Hardware Allocation: Set num_workers=4, pin_memory=True, and
persistent_workers=True.
## Training Schedule
* Batch Configuration: Set train_batch_size: 2 and gradient_accumulation_steps: 2 (Creates
an effective total batch size of 4, ensuring smooth gradient updates for complex audio
signals).
* Learning Rate: 0.00007 ($7\cdot10^{-5}$) with a cosine scheduler and 100 warmup steps.
A lower learning rate preserves sharp transient structures like tight kick drums.
* Seed: Set to -1 (Random Seed) across later epochs to shuffle data blocks and improve
generalization.
## 4. Training Phases & Loss Graph Analysis
The training graph demonstrates a mathematically ideal convergence curve for a dense
audio dataset under a randomized training seed:
Loss
0.55 | \
0.50 | \
0.45 | \_________
0.40 | \________ [Plateau / Saturated Fine-Tuning]
0.35 |______________________
+-----------------------
3100 3300 3500 3700 Step
* Phase 1 (Epoch 0 - 30): Macro-Structure Acquisition: The initial loss drops rapidly from
$\sim0.60$ down to $\sim0.45$. The model identifies coarse structural features, including
noise floors, fundamental frequencies, and the main percussive grid.
* Phase 2 (Epoch 30 - 35): Mid-Frequency Stabilization: The curve forms a gentle slope
between step 3100 and 3400. The random data seed (-1) introduces acoustic variety, forcing
the optimizer to consolidate structural patterns across different BPM/Key signatures
simultaneously.
* Phase 3 (Step 3400 - 3800): Micro-Optimization & Transients: The Loss (smoothed) forms
a textbook plateau between $0.36$ and $0.38$. The raw loss values variance narrows down
significantly, occasionally hitting micro-troughs near $0.31$. This indicates that the model
has fully saturated its learning capacity for the dataset and is purely refining micro-details
like phase alignment and crisp transient sharpness. Pushing the model below $0.30$ is
highly discouraged as it triggers immediate acoustic degradation (overfitting).
## 5. Inference & Audio Generation Configuration
Once training concludes at Epoch 40, halt the script and configure the Inference tab using
these precise generation parameters:
* Inference Backend: Set to PyTorch (Do not use vLLM or Triton on native Windows
environments due to library compatibility issues).
* Base Model Path: Point to checkpoints/acestep-v15-xl-sft.
* LoRA Model Path: Load the target checkpoint (e.g., epoch_35 or epoch_40).
* LoRA Scale: 0.85 to 1.0 (Start at 0.85 to maintain flexibility; increase to 1.0 if the synthetic
output lacks the driving weight of the original data).
* Inference Steps: 50 (Provides clean diffusion generation without blurring the fast
transients).
* CFG Scale: 4.5 to 5.5 (Higher values force strict adherence to the prompt tags, lower
values add acoustic variation).
* Audio Length: Exact 30.0 seconds (Must match the training slice length; generating beyond
this window causes structural collapse).
* Target Generation Prompt: Feed the explicit tokens used during tagging to extract the
clean, isolated style:
A high-energy psychedelic trance track, 142 BPM, fast driving rolling bassline, punchy
energetic kickdrum, sharp acid synth leads, rhythmic percussion, crisp hi-hats, studio master
quality, clean professional mix
