LoRA internals - Let's look in the beast entrails

This article was initially written with ZIT in mind, but there is some tricky things (like fused qkv layers) making it simpler for Anima.

What's great with newer DiT models is that it is in fact they are rather simple models: Often times a few exactly similar blocks, a final layer and some weights for time embedding. And since there is not yet all the many tools available for SDXL and derivatives, this allowed me to test and develop some tools on my own, validating my own understanding of diffusion models!

So, here is an other article about LoRA and how they work! This will be a continuation of my article about SDXL, but i'll go FAR more in-depth (i have added some pretty pictures generated from the text to help :D)

1) Neural networks layers

In my previous article, i explained that all current AI models are in fact Neural Networks. They get numbers as input and spit numbers as output, with a lot of knobs to configure them.

Let's get a bit more technical. When using framework like pytorch, a neural network (NN) is defined by creating "modules" made of "layers".

Basically, you build your NN like a lego, by stacking base bricks in a specific order with a goal in mind, you chain all of those and ta-da, you get your full model.

Some very classic layers are:

activation layers
convolution layers
linear layers
normalization layers
modulation layers

A specific kind of NN, the one that started all of this buzz around AI in 2017 is called "transformer" and is now used also as a base in most larger NN, that's all the "cross attention" and "self attention" thingies you'll find. And in fact there is a bunch of linear layers in there, this is important, you'll see later why.

Most important is to remember: one "transformer" layer is often 4 linear layers (noted Q, K, V and O for Query, Key, Value and Output projection). A DiT model leverage a few transformers layers in each block.

What are the role of those layers and why do i care for LoRA?

Because the weights (the knob values) of those layers are what are stored in the model file. When you download a pth, safetensors or gguf file, what you download is just a bunch of values associated to "keys" that helps AI tools put back the right numbers in the right place.

An example of key can be (this is not a real key, but it looks like what you'll find in models): model.block.2.attention.qkv.weight

The weights associated to this key are what take up space, especially for linear layers: those are the layers that are the more common and that benefit the most from stuff like quantization.

2) Focus on the linear layer

Linear layers are basically a table that allows computing X values (the input dimension) to Y values (the output dimension).

The "dimensions" are also named "features": For example, if you have a list of input features of an animal (number of legs, fur/feather or scales, active during day or night, laying eggs or not, etc...) you could have as output features a list of species and the likelyhood of belonging to it : 90% chance of being a mamal, 2% chance of being a bird, etc...

The way they work is by using a matrix multiplication (the famous operation that makes GPU needed).

Example: if you have a 1000 values as input and 256 values as output, the layer is basically a 1000x256 table, that's 256000 numbers to store.

In FP16 or BF16, that's 512000 Bytes, so already ~512 MB, and there a LOT more larger and numerous linear layers than this one in models. (Quick reminder: a byte is 8 bit, so, a 16 bit number use 2 bytes, a 32 bit number use 4 bytes).

Now, let's write this as some light math, with the input I and the output O and the layer L, we have O = I @ L (the @ is the matrix multiplication). I'll skip the "bias" part which basically an addition on top, so, a set of weights of the same length as O. Those are most often left as-is (even if some LoRA does also include bias diff).

If i finetune a model, changing the values of L from L1 to L2, and i want to store and distribute only the change, i could provide L2 - L1... but this is still a 1000x256 matrix, of the same size... that's why LoRA where made.

3) LoRA layers

As i explained before, LoRA is a way to train and store less values when distributing change in the model.

The way it is done internally is this way: For each key in the model worth finetuning (so, mostly linear layers), two keys are stored in the LoRA: up and down. (sometimes named A and B)

Why? Because you can, by multiplying up (U) and down (D) together rebuild an approximate of L2 - L1, based on a new value: the rank.

If the layer L is 1000x256 and the rank is 32, U is 1000x32 and D is 32x256, because U @ D is back to 1000x256. But instead of storing 256000 values, only 32000 + 8192 = 40192 values. That's about 84% less values to store!

But aside from training from scratch, how can i make a LoRA out of an already finetuned model?

4) SVD to the rescue

As i explained in my old (and outdated) article about LoRA extraction a mathematical operation, the Singular Value Decomposition, can help do this.

This is an operation that can take a matrix L and compute other matrix/vector such as: L = U @ S @ Vh

This is may seem overkill by itself but what's interesting is the way those are constituted: if you take only a part of S (like... only how many values as the desired rank of the LoRA 😁) and do some maths, you can generate the Up and Down keys!

5) A quick note regarding alpha

Some (most, really) LoRA leverage an additional parameter called alpha (except ZIT LoRA AFAIK). This is used to increase the value of the numbers stored to avoid losing too much precision.

The way it works is:

During training, the weights are multiplied by a Scale = Rank / Alpha. That's why alpha is often a ratio of rank, for example if Rank = 32 and Alpha = 16, then the Scale is 2.
During inference, once the weights are rebuilt, they are divided by the Scale. In the LoRA, the Rank can be figured out by looking at the LoRA Up or Down keys but Alpha must be stored as an additional value.

6) Let the fun begin!

Why explaining all this (yeah really, why)?

Because now, you have all the needed knowledge for some standard operations with LoRA:

Redimension a LoRA: you redo L = U @ D and then redo the SVD for an other rank
Merge LoRA regardless of rank: you do Merge = U1 @ D1 + U2 @ D2 (optionally multiplying each result by a strength factor like 0.8) and with a SVD of Merge, you get Umerge and Dmerge. (PS: if the LoRA are of the same rank, you can just multiply the key by a value and add them. The value must be the square root of the expected strength since both U and D are multiplied by this value).
Extract LoRA from the difference between two models: That's the SVD of Model2 - Model1 for each key. And especially, with a "preview" model such as Anima changing every two month, this could be useful! :D

So, here is some python to demonstrate. BUT since this all academic, i will not release a tool. This could be leveraged to extract LoRA from merged checkpoint and this could be unfair to model creators. I will still demonstrate how it works as a way to "port" stuff from previous preview version to newer.

Here is what each Anima preview versions does of the same prompt:

What if i could make a LoRA that leverage the first version and bring it up to speed to the third one?

Could i use this LoRA to "upgrade" my previous model/LoRA? (here is the spoiler, no sadly, they are too far appart).

But anyway, here is some code (that i made EXTRA simple to avoid making it too hard to read):

import sys, re

# Very basic args check
if len(sys.argv) < 4:
    print(f"Usage: {sys.argv[0]} origin.safetensors tuned.safetensors lora.safetensors")
    exit(1)

# Importing the libs
from safetensors.torch import load_file, save_file
import torch
from tqdm import tqdm

# No gradient needed
torch.set_grad_enabled(False)

# List of target layer per diffusion block for LoRA
anima_lora_layers = [
    "cross_attn.k_proj.weight",
    "cross_attn.output_proj.weight",
    "cross_attn.q_proj.weight",
    "cross_attn.v_proj.weight",
    "mlp.layer1.weight",
    "mlp.layer2.weight",
    "self_attn.k_proj.weight",
    "self_attn.output_proj.weight",
    "self_attn.q_proj.weight",
    "self_attn.v_proj.weight",
]

# List of llm_adapter interesting layers
# Almost same as above, only named differently
anima_llm_layers = [
    "cross_attn.k_proj.weight",
    "cross_attn.o_proj.weight",
    "cross_attn.q_proj.weight",
    "cross_attn.v_proj.weight",
    "mlp.0.weight",
    "mlp.2.weight",
    "self_attn.k_proj.weight",
    "self_attn.o_proj.weight",
    "self_attn.q_proj.weight",
    "self_attn.v_proj.weight",
]

# Detect and remove key prefix
def remove_prefix(st):
    # Remove VAE and TE if present
    for k in list(st.keys()):
        if k.startswith("cond_stage_model.") or k.startswith("first_stage_model."):
            _ = st.pop(k)

    # Read first diffusion model key
    sample_key = list(st.keys())[0]

    # Find prefix used among three known prefix
    prefix_match = re.match(
        r"^(model.diffusion_model.|net.|diffusion_model.)", sample_key
    )

    # if found, remove it
    if prefix_match:
        prefix = prefix_match.groups()[0]
        for k in list(st.keys()):
            st[k.replace(prefix, "")] = st.pop(k)


# The most BASIC svd extraction of lora weight
def extract_lora_weight(m, r):
    u, s, vh = torch.linalg.svd(m.cuda())
    B = u[:, :r] @ torch.diag(s[:r])
    A = vh[:r, :]
    return (B.cpu().contiguous(), A.cpu().contiguous())

# Prepare the list of weights based on target layers
weight_list = []
for block_index in range(28):
    for layer in anima_lora_layers:
        weight_list.append(f"blocks.{block_index}.{layer}")
for block_index in range(6):
    for layer in anima_llm_layers:
        weight_list.append(f"llm_adapter.blocks.{block_index}.{layer}")

# Load models
print(f"Loading {sys.argv[1]}...")
orig = load_file(sys.argv[1])
remove_prefix(orig)
print(f"Loading {sys.argv[2]}...")
tune = load_file(sys.argv[2])
remove_prefix(tune)

# Get ready
lora = {}
rank = 64
alpha = float(32)
scale = rank / alpha

# Extract LoRA
print("Extracting...")
for k in tqdm(weight_list):
    diff = (tune[k].to(dtype=torch.float) - orig[k].to(dtype=torch.float)) * scale
    if torch.mean(torch.abs(diff)) < 0.0001:
        # Not enough difference, let's skip it
        del diff
        continue
    B, A = extract_lora_weight(diff, rank)
    del diff
    base_key = k.replace(".weight", "")
    lora[f"diffusion_model.{base_key}.lora_B.weight"] = B.to(dtype=torch.bfloat16)
    lora[f"diffusion_model.{base_key}.lora_A.weight"] = A.to(dtype=torch.bfloat16)
    lora[f"diffusion_model.{base_key}.alpha"] = torch.tensor(
        alpha, dtype=torch.bfloat16
    )

# Save LoRA
print(f"Saving {sys.argv[3]}...")
save_file(lora, sys.argv[3])
print("Done!")
exit(0)

Basically, here, i am targeting:

The usual linear layers that are targeted by Anima LoRAs (on all 23 blocks)
Some extra layers from the llm_adapter (that are not usually trained, but here, i need those since i am looking at a full finetune). These are 6 extra blocks.
I am using Rank 64 and Alpha 32.
You'll notice both transformers layers (q,k,v and o) and MLP layers (multilayer perceptron, a very classic NN)

Here it is executed:

# python extract_lora.py anima-preview.safetensors anima-preview3-base.safetensors p1-to-p3.safetensors
Loading anima-preview.safetensors...
Loading anima-preview3-base.safetensors...
Extracting...
100%|███████████████████████| 340/340 [04:08<00:00,  1.37it/s]
Saving p1-to-p3.safetensors...
Done!

The core math is the SVD:

def extract_lora_weight(m, r):
    u, s, vh = torch.linalg.svd(m.cuda())
    B = u[:, :r] @ torch.diag(s[:r])
    A = vh[:r, :]
    return (B.cpu().contiguous(), A.cpu().contiguous())

I am using it to compute the Up and Down from (Preview 3) - (Preview 1).

Now, did it works? Forge accept my LoRA:

[LORA] Loaded p1-to-p3.safetensors for KModel-UNet with 280 keys at weight 1.0 (skipped 0 keys) with on_the_fly = False
[LORA] Loaded p1-to-p3.safetensors for KModel-CLIP with 60 keys at weight 1.0 (skipped 0 keys) with on_the_fly = False

But the result is messy:

This is most probably due to some other weight being changed that are not included in the LoRA. But what about merged checkpoints? Here i'll leverage the very nice Anima model from CitronLegacy

Here is what it does:

If i run the extract against Preview 1 (the source of this particular model), here is what the LoRA did when using the base preview model:

[LORA] Loaded CAT_extract.safetensors for KModel-UNet with 275 keys at weight 1.0 (skipped 0 keys) with on_the_fly = False

That's VERY close! What about using it on Preview 3:

Nice! Most of the style is still there and we are running on the newest version 💜

AAAND That's all 🥰 Thanks everyone for reading, i hope it either made some people learn more about LoRAs and/or helped model creators 🤗

Extra tech for the curious: At the very beginning, it talked about qkv fused layers.

Some models don't store q, k and v linears layers of a transformer layer independantly but as a single qkv layer. Those are basically just a concatenation, but in the example of ZIT, in the LoRA they are separated but in the Comfy compatible model, they are fused. This means to do the math, you must first split those and even as simple as it is, it could have confused people.

It's basically "let's cut it in three part":

def remap_qkv(key, state_dict):
  weight = state_dict.pop(key)
  to_q, to_k, to_v = weight.chunk(3, dim=0)
  state_dict[key.replace(".qkv.", ".to_q.")] = to_q
  state_dict[key.replace(".qkv.", ".to_k.")] = to_k
  state_dict[key.replace(".qkv.", ".to_v.")] = to_v

And voilà, you get all the keys you need 😝