Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to fine tune it? #8

Open
n0r8 opened this issue Dec 9, 2022 · 43 comments
Open

How to fine tune it? #8

n0r8 opened this issue Dec 9, 2022 · 43 comments
Labels
question Further information is requested

Comments

@n0r8
Copy link

n0r8 commented Dec 9, 2022

I am a noob. Can you describe how I can fine-tune it with your program? It is possible? Maybe some articles.

@ggerganov ggerganov added the question Further information is requested label Dec 10, 2022
@ggerganov
Copy link
Owner

Fine-tuning is not possible at the moment. You can fine-tune a model with some other implementation and then convert it and use it with ggml. For example, here are instructions how to fine-tune Whisper:

https://github.com/ggerganov/whisper.cpp/tree/master/models#fine-tuned-models

@loretoparisi
Copy link

Fine-tuning is not possible at the moment. You can fine-tune a model with some other implementation and then convert it and use it with ggml. For example, here are instructions how to fine-tune Whisper:

https://github.com/ggerganov/whisper.cpp/tree/master/models#fine-tuned-models

Hey Georgi, just curious, why fine-tuning is not possible, technically speaking? Let's ignore CUDA right now, but assumed it would work CPU only, what is missing? Thank you!

@ggerganov
Copy link
Owner

We need the backward pass for all tensor operations involved.
Currently, we have it implemented only for some of them

@n0r8
Copy link
Author

n0r8 commented Mar 30, 2023

Thx, after I asked question here I researched and understood that your software provides an interface not the tooling for fine tuning. Was noob back the time I asked this.

@xaedes
Copy link

xaedes commented Apr 22, 2023

Training directly with ggml would be really nice.
Implemented 8 out of 14 missing tensor ops.

xaedes/llama.cpp@757de70

I had to add another ggml operation GGML_OP_ADD_AT as counterpart for GGML_VIEW in the backward pass. This duplicated the code for add functions. Maybe the offset parameter can just be moved to the regular add functions which can then be used for ADD_AT. Was not sure about the performance of doing it this way, so I just duplicated the functions for now.

I will continue with the rest and test it with this repos test_grad and make a pull request when I think it is ready.

@xaedes
Copy link

xaedes commented Apr 24, 2023

Only GGML_ROPE is missing for llama, and GGML_OP_GET_ROWS, but this is only required for training the tokenizer embeddings. So far the gradients are untested, that will come next right after rope backward is implemented.

xaedes/llama.cpp@28de592

Unfortunately a bunch of new operations had to be added:

Add necessary ggml operations GGML_OP_ADD1, GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK, GGML_OP_DIAG_MASK_ZERO, and GGML_OP_ROPE_BACK

GGML_OP_ADD1 is necessary to add a scalar value in the backward pass of GGML_OP_SOFT_MAX
GGML_OP_ADD1 could also be replaced by using GGML_OP_ADD and GGML_OP_REPEAT, but the performance would be worse. additionally GGML_OP_REPEAT will return unexpected value when the the input to GGML_OP_SOFT_MAX contains only a single scalar. in this case GGML_OP_REPEAT will not return the value that should be repeated (src1) but the value which shape the result should take (src0). So in this case it can not replace GGML_OP_ADD1.

GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK and GGML_OP_ROPE_BACK are necessary for backward pass of GGML_OP_SILU, GGML_OP_RMS_NORM and GGML_OP_ROPE. The backward pass for these functions cannot be easily composed of existing operations. Since the backward pass builds a computation graph we need the operations' forward pass implementations of the required backward passes. Sounds a bit confusing at first, I know...

GGML_OP_DIAG_MASK_ZERO is necessary for backward pass of GGML_OP_DIAG_MASK_INF.

Some operations where previously inplace-only. for backward pass there needs to be non-inplace variants.
staying consistent with other operations that have non-inplace and inplace variants, the operations are changed to non-inplace and
functions with "_inplace" are added which are inplace.
in llama we need to call the inplace variants so that it is implemented as before.
for llama backward pass we need to use the non-inplace variants.

@ggerganov
Copy link
Owner

@xaedes
Wow! If you manage to pull this off and make training work it would be amazing!

@xaedes
Copy link

xaedes commented Apr 28, 2023

I successfully tested every backward pass except rms_norm.
Hm, and maybe I have to take another closer look at view_2d and view_3d --- view_1d works.
Also tested optimizing parameters A & B in Sum((A*B-C)**2), which I think is necessary for LoRa training.

When rms_norm also works I will be pushing it in a proper branch of my llama fork as I currently have python bindings in my working branch.
Unfortunately I can't easily work directly in the ggml repo, as windows support is not as good as in llama and I am on a windows machine here.

When necessary I can also rewrite that stuff to integrate with the mentioned refactoring, but I will make it work here first^^

@danforbes
Copy link
Contributor

@xaedes have you looked into Windows Subsystem for Linux?

@xaedes
Copy link

xaedes commented May 1, 2023

@danforbes Yep, but for easier development with the setup I am used to I wanted it to work in other environment and didn't want to get lost on fiddling with platform stuff^^

@xaedes
Copy link

xaedes commented May 1, 2023

@ggerganov I have now successfully tested all backward passes necessary for llama. https://github.com/xaedes/llama.cpp/tree/training-integrate

List of all new operations that I had to add:

  • GGML_OP_ADD1 : I think I could replace it with add(X,repeat(Y,X))
  • GGML_OP_ADD_AT : Necessary for view backward pass. This adds src1 to view(src0) but returns tensor of shape src0. Maybe this operation could get another name like ACC_VIEW?
  • GGML_OP_SUM_ROWS : Necessary for repeat backward pass: Reduces rows by summing them. shape[a,b,c,d] -> shape[1,b,c,d]
  • GGML_OP_SILU_BACK : Necessary for silu backward pass
  • GGML_OP_RMS_NORM_BACK : Could also be implemented using primitives, at the cost of performance.
  • GGML_OP_GET_ROWS_BACK : Necessary for get_rows backward pass: Adds src0[i] rows to opt0[src1[i]] rows, returning a tensor of shape opt0. Maybe this operation could get a more meaningful name, something like ADD_ROWS_TO, or ACC_ROWS_TO?
  • GGML_OP_DIAG : Necessary for softmax backward pass, alternative would have been to implement SOFTMAX_BACK directly, but DIAG is at least usable for other stuff. It turns rows into diagonal matrices.
  • GGML_OP_DIAG_MASK_ZERO : Necessary for diag_mask_inf backward pass
  • GGML_OP_ROPE_BACK : Necessary for rope backward pass.

Notable other changes:

  • add inplace and non-inplace variants for scale, diag_mask_inf, soft_max and rope
  • fix sub, mul and div functions to work correctly with transposed tensor, uses the same logic as in add:
  • fix ggml_forward_add functions to work correctly with transposed tensors. uses the same logic as in ggml_compute_forward_add_q_f32, but make it consistent across all ggml_compute_forward_add_... functions. this also slightly changes the mem access pattern of the different threads to work as in ggml_compute_forward_add_q_f32. Maybe the mem pattern in the for loop for (int j = ith; j < n; j += nth) was important to keep? Each thread now has consecutive range of rows to process: for (int ir = ir0; ir < ir1; ++ir)
  • de-duplicate ggml_forward_dup code taking care of contiguous tensors of same type. with this we can duplicate tensor of any type as long as they are contiguous. the function is used in dup, get_rows_back and diag_mask (when not inplace).
  • there are some maybe too verbose comments including step-by-step derivation of gradients that could be cleaned away.

Next I will look into making an example for training a baby llama or a small LoRa finetune on some late layer. Could be that I find some still undiscovered issues during this^^

@ggerganov
Copy link
Owner

Maybe this operation could get another name like ACC_VIEW?

I guess GGML_OP_ACC implies we are accumulating into src0, so maybe go with that

Maybe this operation could get a more meaningful name, something like ADD_ROWS_TO, or ACC_ROWS_TO?

No strong preference

Maybe the mem pattern in the for loop for (int j = ith; j < n; j += nth) was important to keep?

Don't think it is important. At some point I was thinking that one of the methods is better than the other, but I think in the end they give pretty much the same performance.

Amazing work!
Do you know how to train models?
It will be super interesting to see what you do, especially if you can make it work with ggml.

@xaedes
Copy link

xaedes commented May 1, 2023

Do you know how to train models?

Got a baby llama model trained from scratch to output sin signal:
https://github.com/xaedes/llama.cpp/commits/train-example

After training with one call to ggml_opt with default settings for ADAM and only one example its output is better, but still not really good, of course after only seeing one example^^.
But it shows that the whole pipeline w.r.t gradients should work.

@ggerganov
Copy link
Owner

Ha, I didn't realize we can simply train with mathematical functions. Was always thinking we need to get some text data into this.

Ok, I understand the idea - the cost function is F=sum((logits - expected)^2) and we optimize F with respect to all weights. Very cool stuff!

@xaedes
Copy link

xaedes commented May 1, 2023

Training directly with sin, etc., given the required ggml operations, would also be possible, but here I just tokenized sinus output float [-1.0,+1.0] to token id[0..n_vocab-1].

The cost function I used in this first throw is probably not good - just took what I used in test_opt.c for a first test^^
Should probably look which cost function usually is used, something like cross entropy?

@ggerganov
Copy link
Owner

Should probably look which cost function usually is used, something like cross entropy?

Not sure what is normally used in practice.
At the very least, the sum of squared differences could probably benefit from some sort of regularization of the weights.

@xaedes
Copy link

xaedes commented May 1, 2023

Ohh, switching from adam to lbfgs produces MUCH better results!

best samples before optimization:

   X
         X
        X
      X
          X
        X
 X
   X
               X
      X
    X
               X
     X
           X
              X
               X
  X
               X
        X

best samples after optimization:

       X
          X
            X
              X
               X
              X
            X
          X
       X
    X
  X
X
X
X

When optimized with adam, best samples after optimization:

       X
    X
       X
              X
         X
          X
X
  X
         X
   X
  X
     X
               X
     X
        X
        X
        X
      X
    X
            X

@ggerganov
Copy link
Owner

Yeah, I have always wondered why ADAM is considered state-of-the-art

@xaedes
Copy link

xaedes commented May 1, 2023

Maybe there lingers a bug in opt_adam somewhere? Anyway just sticking with lbfgs for now, that sinus looked real good :)
When I figure out how to train it on more than one example I can see how to get LoRa finetuning to work.

@loretoparisi
Copy link

Yeah, I have always wondered why ADAM is considered state-of-the-art

Adam or AdamW? The latter should be preferred...

@Alasdair0
Copy link

Alasdair0 commented May 5, 2023

@xaedes how's your progress on the training? Is it ready for some tests? I have datasets and hardware sitting around, I'd be happy to take it for a test drive and deliver some stats on the performance.

Also you know, assuming you can train it for anything, it might be a more interesting development than people realize. If you can finetune on CPU at some kind of reasonable speed it means you can use the technology in a different context. For example, you could have an application built with a thread running that is continuously training incrementally whatever the user adds to it, and also adding their previously conversations with it timestamped. That would mean the user could add a github repository that would be ingested, add their own code, or they could say "hey, remember yesterday we were talking about xyz, I just had a thought..." It would be a breakthrough.

GPU training has to be configured for the hardware, but CPU training can be run on anything in the background. No need to install dependencies, no need to overheat the room. You could use it in an app and train on transcripts of phone calls, so the user could ask about previous conversations they've had "what time did Sarah say we were meeting?". Gamechanger.

I also suspect we're going to see next generation CPUs bridging the gap even more.

@xaedes
Copy link

xaedes commented May 7, 2023

@Alasdair0

Found some bugs along the way that needed some time to fix...^^ In the first tests the gradient did not actually get propagated to all model parameters. At first I also trained it to predict the current token instead of the next token and wondered quite some time why it would only generate flat lines, despiting hitting the target logits very good during training.

Now llama from scratch generating endless sinus wave works correctly :)

https://github.com/xaedes/llama.cpp/commits/train-example

Training on multiple examples now also works. Just calling ggml_opt with low max iterations in a loop and properly cleaning up created tensors for loop was enough. But it generates some unnecessary overhead by recreating the whole computation graphs forward and backward each time. With more refactoring of ggml_opt we could just reuse the forward and backward computation graphs and optimizer state. I experimented a bit with it but decided against using it for now, because I did not want to touch the ggml_opt functions unless absolutely necessary.

A parallel batched forward function would probably be a good improvement. Training on multiple examples in (parallel) batch really seems to improve the training, but currently I can only do that by calling the forward function multiple times with different input data, which costs a lot of nodes in the computation graph, especially since backward pass is necessary as well.

Changing the target logits from 0 & 1 to -1 & +1 greatly improved the training.

I tried cross-entropy loss on the softmaxed probabilities instead of sum of squared logit errors, but it was consistently worse.

I did not look into training a LoRa finetune yet, but the necessary machinery for that seems to be working.

@ggerganov

Maybe this operation could get another name like ACC_VIEW?

I guess GGML_OP_ACC implies we are accumulating into src0, so maybe go with that

Ok then I change it to use that name and prepare a pull request with training from scratch example, before I got lost any longer on the LoRa finetune, that can come next.

Just to make sure there is no misunderstanding what the GGML_OP_ACC function does. The corresponding function signature currently looks like below.
An important part about the function is that it can apply a view with nb1, etc., so we can add at a specific position (and strides).

GGML_API struct ggml_tensor * ggml_add_at(
        struct ggml_context * ctx,
        struct ggml_tensor  * a,
        struct ggml_tensor  * b,
        size_t                nb1,
        size_t                nb2,
        size_t                nb3,
        size_t                offset);
// dst = a  
// view(dst, nb1, nb2, nb3, offset) += b
// return dst

@Alasdair0
Copy link

Alasdair0 commented May 8, 2023

@xaedes you are a hero!

I'm going to address some of those things you mentioned, but I might just be highlighting my ignorance, because while I have a lot of experience with old-school classification, I've not been working for a few years and am only just updating myself with how transformers work.

(1) Recreating the computation graphs has always seemed inefficient to me, but it does the same thing during inference, no? If so, a solution here could be a breakthrough for faster inference. It would have to be a perfect reproduction though, not an approximation, because otherwise you would not be training for the same result (although its possible that it only needs to be approximate.)

It's always struck me as odd that the full computation is done for every token, again and again. Especially because it's obvious that if the model can write "and" as the next word to anything at all, it already calculated in some sense what was going to come next. That's just wasted the way these models currently work. Some people say it's just a next token predictor, but after my struggle to understand how it works, I see now that that's not true - it does in some sense, understand.

(2) Training on multiple examples, there should be a sweet spot in theory. However, when you're doing this on natural language you always have a batch because there's a ton of tokens to predict for every one new "document" trained on. BTW, for ingesting a document every word is masked except the first, and for instruct or finetuning generally the input/question is not masked and the output/answer is masked.
Example for training a document
[1 5 9 2] would mean [1] targeting [5], then [1 5] targeting [9], then [1 5 9] targeting [2].
Where as for instruct tuning the user would pass 2 vectors of:
[1 5] [9 2] -> [1 5] targeting [9], then [1 5 9] targeting [2]
I know you know this already, I'm trying to point out that there a needs to be a neat way for it to be trained with or without masking. In terms of the programming efficiency it would make sense that they're on the same vector and the index of the beginning of the masked section is passed as a parameter, and this is set to 1 if not provided (to mask all tokens except the first.)

There is also the question of left vs right hand padding, and the padding token. The padding token for LLaMa is 0, as can be seen here. However, it doesn't have a default left or right hand padding because there is no padding in the base model, every batch was always 2048 length. In the finetuning implementations online, people are using left or right, there is no standard. I recommend left padding because it ensures that the attention mechanism can focus on the actual text without being affected by padding tokens that have no meaningful content. In contrast, right padding could potentially introduce noise in the attention scores, as the model needs to learn to ignore the padding tokens.

(3) Changing logits target from 0&1 to -1&1, in theory that should make no difference between it's immediately softmaxed after this no? If it makes a difference then I would first ask how many bits you're storing that with? Potentially the reason it's helping is because of the fact you're trying to predict a sin wave, and so any lossyness is going to normalize -1&1 towards 0, which is better for your sin wave (it's because you have an even distribution, whereas you would not have an even distribution with natural language.) Whatever the reason it's an indication of a problem elsewhere.

(4) Cross-entropy vs sum of squared logits. Becareful here because you're training a text-based sin wave generator, right? Cross-entropy is recommended for transformers & natural language, but your test is not doing that. Your sin wave test can't really be used as an example of what natural language looks like, so I wouldn't try optimizing for it. I can provide you with datasets if you need.

(5) LoRA is the future for sure, because it allows incremental finetunes over and over. But yeah, gotta get it working first, it's already impressive!

(6) Are you using SELU activation function? That's what recommended for LLaMa. Also look into flash attention, I've talked a few people about this and it's considered the best attention mechanism for LLaMa by a country mile. I understand @ggerganov tested this for inference and found it not really any better, but for training at least I've anecdotally heard 20x speed and 90% memory improvement. More specifically it reduces the memory requirements for longer context lengths to O(n).

@xaedes
Copy link

xaedes commented May 12, 2023

@Alasdair0

It's always struck me as odd that the full computation is done for every token, again and again.

At least with kv_cache a lot of computation can be avoided during interference.
During training with whole new samples this doesn't help much though.

Where as for instruct tuning the user would pass 2 vectors of: [1 5] [9 2] -> [1 5] targeting [9]

For this case the training should be able to make use of the kv_cache with n_past = 2.

Your notes regarding the padding are interesting, I will keep them in mind when engaging with actual model lora finetunes!

After some further tests, having other issues resolved and now with parallel batched training, I find that adam and cross entropy works as good if not sometimes better than lbfgs and squared error sum.

As adam provides an easier parameter for a learning schedule (still todo, some very firsts tests with exp-decay where meh) I will probably focus more on that.

@loretoparisi suggested that AdamW should be prefered. I don't know which one we use, probably without W, but after short skimming it should be just computing a different scalar somewhere. It would make sense to try implement and test it for training if we are not already using it.

Do you mean SELU instead of SwiGLU? SwiGLU, internally using SILU, as used in the paper and the official llama inference code, is also used in llama.cpp.

Flash attention could be really interesting when it helps improve training performance that much! The ggml forward pass implementation looks way less intimidating than I remember it from the paper. Maybe I should look into implementing the backward pass for that as well some time.

@ggerganov
Copy link
Owner

Here is the flash attention that I've tried without gaining any performance: ggerganov/llama.cpp#778

As a side note, today I was intrigued by the "multi-query" attention paper that uses n_head times less KV cache memory: https://arxiv.org/pdf/1911.02150.pdf . If we start training baby LLaMAs, we might want to consider this :)

@loretoparisi
Copy link

Here is the flash attention that I've tried without gaining any performance: ggerganov/llama.cpp#778

As a side note, today I was intrigued by the "multi-query" attention paper that uses n_head times less KV cache memory: https://arxiv.org/pdf/1911.02150.pdf . If we start training baby LLaMAs, we might want to consider this :)

hahahaha me too I was wondering if the multi-query attention was possible!
Here I have found a PyTorch implementation, I didn't try it btw

class MultiQueryAttentionLayer(nn.Module):
    def __init__(self, hid_dim, n_heads,  dropout, device):
        super().__init__()
        
        assert hid_dim % n_heads == 0
        
        self.hid_dim = hid_dim
        self.n_heads = n_heads
        self.head_dim = self.hid_dim // self.n_heads

        self.fc_q = nn.Linear( self.hid_dim, self.hid_dim)
        self.fc_k = nn.Linear( self.hid_dim, self.head_dim)
        self.fc_v = nn.Linear(self.hid_dim, self.head_dim)  
        self.fc_o = nn.Linear(self.hid_dim, self.hid_dim)
        
        self.dropout = nn.Dropout(dropout)
        
        self.scale = torch.sqrt(torch.FloatTensor([self.head_dim])).to(device)
        
    def forward(self, query, key, value, mask = None):
        
        batch_size = query.shape[0]
        
        #query = [batch size, query len, hid dim]
        #key = [batch size, key len, hid dim]
        #value = [batch size, value len, hid dim]
               
        Qbank = self.fc_q(query).view(batch_size, -1, self.n_heads, self.head_dim).permute(0, 2, 1, 3)
        Kbank = self.fc_k(key).view(batch_size, -1, 1, self.head_dim).permute(0, 2, 3, 1)
        Vbank = self.fc_v(value).view(batch_size, -1, 1, self.head_dim).permute(0, 2, 1, 3)   
        
        #Qbank = [batch size, n heads, query len, head dim]
        #Kbank = [batch size, 1, head dim, key len]
        #Vbank = [batch size, 1, value len, head dim]

        energy = torch.matmul(Qbank, Kbank) / self.scale

        #energy = [batch size, n heads, query len, key len]
        
        if mask is not None:
            energy = energy.masked_fill(mask == 0, -1e10)
        
        attention = F.softmax(energy, dim = -1)
                
        #attention = [batch size, n heads, query len, key len]

        x = torch.matmul(self.dropout(attention), Vbank)

        x = x.permute(0, 2, 1, 3).contiguous()

        x = x.view(batch_size, -1, self.hid_dim)
        
        #x = [batch size, seq len, hid dim]
        
        x = self.fc_o(x)
        
        return x, attention

@loretoparisi
Copy link

Oh Wow! Interestingly there is a more recent Multi-Query Attention implementation by MosaicLM team for the MPT 7B here I did not know they were using Multi-Query attention actually for the MPT models did you?

class MultiQueryAttention(nn.Module):
    """Multi-Query self attention.

    Using torch or triton attention implemetation enables user to also use
    additive bias.
    """

    def __init__(self, d_model: int, n_heads: int, attn_impl: str='triton', clip_qkv: Optional[float]=None, qk_ln: bool=False, softmax_scale: Optional[float]=None, attn_pdrop: float=0.0, low_precision_layernorm: bool=False, device: Optional[str]=None):
        super().__init__()
        self.attn_impl = attn_impl
        self.clip_qkv = clip_qkv
        self.qk_ln = qk_ln
        self.d_model = d_model
        self.n_heads = n_heads
        self.head_dim = d_model // n_heads
        self.softmax_scale = softmax_scale
        if self.softmax_scale is None:
            self.softmax_scale = 1 / math.sqrt(self.head_dim)
        self.attn_dropout_p = attn_pdrop
        self.Wqkv = nn.Linear(d_model, d_model + 2 * self.head_dim, device=device)
        fuse_splits = (d_model, d_model + self.head_dim)
        self.Wqkv._fused = (0, fuse_splits)
        if self.qk_ln:
            layernorm_class = LPLayerNorm if low_precision_layernorm else nn.LayerNorm
            self.q_ln = layernorm_class(d_model, device=device)
            self.k_ln = layernorm_class(self.head_dim, device=device)
        if self.attn_impl == 'flash':
            self.attn_fn = flash_attn_fn
        elif self.attn_impl == 'triton':
            self.attn_fn = triton_flash_attn_fn
            warnings.warn('While `attn_impl: triton` can be faster than `attn_impl: flash` ' + 'it uses more memory. When training larger models this can trigger ' + 'alloc retries which hurts performance. If encountered, we recommend ' + 'using `attn_impl: flash` if your model does not use `alibi` or `prefix_lm`.')
        elif self.attn_impl == 'torch':
            self.attn_fn = scaled_multihead_dot_product_attention
            if torch.cuda.is_available():
                warnings.warn('Using `attn_impl: torch`. If your model does not use `alibi` or ' + '`prefix_lm` we recommend using `attn_impl: flash` otherwise ' + 'we recommend using `attn_impl: triton`.')
        else:
            raise ValueError(f'attn_impl={attn_impl!r} is an invalid setting.')
        self.out_proj = nn.Linear(self.d_model, self.d_model, device=device)
        self.out_proj._is_residual = True

    def forward(self, x, past_key_value=None, attn_bias=None, attention_mask=None, is_causal=True, needs_weights=False):
        qkv = self.Wqkv(x)
        if self.clip_qkv:
            qkv.clamp_(min=-self.clip_qkv, max=self.clip_qkv)
        (query, key, value) = qkv.split([self.d_model, self.head_dim, self.head_dim], dim=2)
        key_padding_mask = attention_mask
        if self.qk_ln:
            dtype = query.dtype
            query = self.q_ln(query).to(dtype)
            key = self.k_ln(key).to(dtype)
        if past_key_value is not None:
            if len(past_key_value) != 0:
                key = torch.cat([past_key_value[0], key], dim=1)
                value = torch.cat([past_key_value[1], value], dim=1)
            past_key_value = (key, value)
        if attn_bias is not None:
            attn_bias = attn_bias[:, :, -query.size(1):, -key.size(1):]
        (context, attn_weights) = self.attn_fn(query, key, value, self.n_heads, softmax_scale=self.softmax_scale, attn_bias=attn_bias, key_padding_mask=key_padding_mask, is_causal=is_causal, dropout_p=self.attn_dropout_p, training=self.training, needs_weights=needs_weights, multiquery=True)
        return (self.out_proj(context), attn_weights, past_key_value)

@alasdairforsythe
Copy link

@xaedes
Same guy, different username (long story.)

I see the baby-llama has been merged into the master branch and draws a pretty sin wave. What's the intention for this going forward? Is it just a proof of concept and you're happy, or do you intend to expand it to the point of realistically being able to train an LLM from scratch.

I've been working on an optimal tokenizer (I've just completed an ungreedy version and putting that up in a few days once the vocabs are built) and text normalizer to support that. The idea of having a CPU-based training that I could train from scratch on my own tokenizer is appealing to me. My goal is to have training running 24/7 on a server just slowly learning forever, whilst saving it's state out regularly. How far away are we from that dream?

@xaedes
Copy link

xaedes commented May 18, 2023

@alasdairforsythe I am still working on the text training.

https://github.com/xaedes/llama.cpp/tree/text-from-scratch

To make it work at all with 32001 sized vocabulary I had to improve training performance quite some bit by replacing some slow functions and avoiding the need of creating huge intermediate matrices.

Got a small example working to train from scratch a small model on some text. Works okay, but still not really happy with cross entropy loss function. Often lands in bad local minima where it does not really go out of. But I found a bug recentently in my cross entropy loss function, fixing it may improve that. Squared error converges a lot faster, but I suspect it also overfits greatly to the given examples, because it essentially directly trains to a specific target probability distribution (what I define in the examples) instead of slowly training it the distribution of the actual whole dataset.

Training a real sized llama from scratch with 4M batch size like they did in original paper will probably require lots of memory and runtime.. Batch size > 1 seems to be absolutely necessary for good training, but it multiplies the memory and runtime.
Maybe someone knows a good trick that can be used to easily manage 4m?
Some kind of gradient accumulation is probably necessary to support such huge batch sizes, need to touch the optimizers for that. When I look into that I might also take a look at AdamW.

But I might look into LoRa finetuning first before further pursuing the batch size issue, because for finetuning smaller batch sizes are supposed to be ok.

The other suggestion of multi-query attention sounds very interesting for training from scratch, might look into that, doesn't look too hard to implement and test.

@Green-Sky
Copy link
Contributor

@xaedes you should initialize the parameters of the model to small values, to converge faster. I know you use a normal distribution, which lands them more in the small value range, but most models use -0.02 - 0.02, or something similar.

randomize_model(&model, 1337, 0.0f, 1.0f, -1.0f, +1.0f);

eg: https://huggingface.co/openlm-research/open_llama_7b_400bt_preview/blob/02302cca2ce4f07a32e56c8ec91591d35445b16f/config.json#L9

Often lands in bad local minima where it does not really go out of.

maybe tweak learn rate or something?

@alasdairforsythe
Copy link

@xaedes

Regarding batch size: usually when these are trained they're using multiple GPUs so each "batch size" is broken down into multiple micro batches, each GPU processes one micro batch and then these are merged into the single batch. But an advantage of doing it on CPU is using system RAM, which I assume you have a lot of? I can lend you a 256GB RAM server for testing if you need.

From my understanding its the activations/intermediates that are increasing the memory usage during higher batch size. Is that right? You only need to have enough memory for 2 sets of upstream gradients at any point in time, and the rest of it is the fixed size of the model.

Aren't the rules different because you're doing this on CPU? A significant bottleneck for the GPU is transferring the data from system memory to GPU memory, but you're not using GPU, and you're not multiplying all those matrices all in parallel. So technically they don't need to all be in memory at the same time. IO may be fast enough to load them in and out, seeing as your operations are largely sequential? Like I said, I'm just learning, so I maybe talking rubbish. But to state the obvious: what stops you from saving these to disk and then later loading them back again? Or at the very least you could load in the activations for 1 layer at a time, since you would only need the activations for each layer one at a time.

You could memory map it. Or, seeing as you know ahead of time how much you need to load in and when, you might be able to do better than memory mapping. You could read an array of them statically cast directly into a memory location that is already defined as whatever the struct is. One thread could be loading in the next batch of intermediates whilst you're "processing" this one. Bottleneck might not even be IO, but if it is, it's not necessarily worse than making a sacrifice somewhere else. If the IO is the bottleneck, gradient checkpointing reduces memory significantly by recalculating some of those values, which would be exactly what you want in that context. You could even have it as an option, depending on the IO speed.

I'd also say it's not unreasonable to "expect" NVMe SSDs if there is not enough RAM. It is 2023.

If I'm out of my depth and wasting your time, just me let me know.

@alasdairforsythe
Copy link

@xaedes I've been pondering this problem whilst attempting to understand more about the problem, and I've come up with the following:

  1. You can recalculate portions of the intermediary calculations instead of saving them. However, I think that it's probably not worth it because CPU is already much slower.
  2. You can attempt some kind of lossy compression and reduce the dimensionality at the cost of accuracy. But again, it's just CPU in exchange for memory, and probably less effective than (1) because you do already have the formula for regenerating it losslessly, so why spend computational resources compressing it when it can just be recreated?
  3. You can save it to disk, as I previously suggested. I just read that this is what Flash Attention does, except between GPU HBM & SRAM. In this case it would be between main memory and HDD.
  4. You can attempt to train only parts of it at a time. But as I imagine this, it's in effect the same thing as training a smaller model, and then training those models to work together, which seems like it defeats the point. If that were the intention, then better to literally just do that.
  5. You could change the backpropagation to not require the intermediate calculations. There a few ideas I have regarding this, but all of them are terrible because it would massively increase training time and would be faster just to recalculate the real values.

That's essentially it, right? Data can be compressed, reduced, recreated, stored or not needed. What else can you do with it?

Given that IO is the unused factor so far, that seems to be where there are obvious "free" gains. To do it well I would suspect means building a little memory-management system. If the structure were arranged so that data that are used together are stored together in memory, they could be easily written out and read in asynchronously. There could be a defined number of these buffers that contain the working data.

You know exactly how much data you need to store that must be accessed at the same time, so from that you can determine the correct size for a buffer. And since you know also the total memory requirements for the model size, etc. and you know dynamically how much memory is available on the machine, it's easy to dynamically calculate the number of these "buffers" that would be used, based on those figures. It means that there can be a user-defined peak memory usage, from which you calculate the number of these buffers.

On the forward pass you would fill a buffer and send the pointer & identifier to the memory manager, who would asynchronously save it and send you back a pointer to a free buffer, or return a nil pointer if none are available (at which point you could either wait or do something else.) And during backpropagation you send a pointer of the buffer you want to "free" and it sends you either a pointer to the next loaded data, or if it's still reading it in, then a nil pointer, at which point you can revert to do something else, such as recreating the data if that's practical.

If the struct is trivial and it begins in an aligned position, you can write it without any additional buffer or serialization by statically casting it to an array of bytes, and vice-versa to get it back. If IO speed is the bottleneck, there are specialized compression formats, such as Snappy, which was designed for live compression and decompression for the purpose of reducing IO bottleneck. You could in fact use Snappy, as a CPU/IO tradeoff, enabling it automatically if the memory manager detects that IO is the bottleneck, which is determined by counting the number of times you try to write a buffer when all the buffers are still being written, or read a buffer when they're still being read. If at least 20% of the requests are met with a failure (no buffer available), switch that boolean flag and from then on use Snappy (or perhaps a compression format better suited for floats) to compress and decompress the data, which will probably halve the IO in exchange for greater CPU load, which would be exactly what you want in that circumstance.

@xaedes
Copy link

xaedes commented May 19, 2023

@Green-Sky Good point, I divided by sqrt of number of dimensions as suggested elsewhere and it really helped improving initial convergence.

@alasdairforsythe Thanks for your input. I think using memmap files as backend for ggml contexts with large mem requirements would give a lot of the features that you described. So maybe we should try that at some time. Maybe gradient accumulation by just looping over different data and summing the gradients used for the optimizer is faster when it can avoid the swapping.
Implementing cross entropy and some backward passes directly as ggml operation could also save some more memory.
I could imagine a compilation pass over the computation graph changing data pointers to reuse some memory, not sure how much could really be saved by it - we already have some inplace operations after all. But if there is some potential it could be interesting to explore in the future.

I fixed the cross entropy loss function and now it works as it should. Overall I am pretty happy with the current state, it actually learns to generate plausible text.

Trained on genesis 1 for 64x16 iterations (256 n_emb, 4 n_layer, 32 n_ctx, 16 n_batch):

Then God said, "Let us make man in--- [generated output follows]
image, after our likeness. And let them have dominion over the fish of the sea and over the birds of the heavens and over the livestock and over every living thing that creeps on the earth." And God said, I have given you every plant yielding seed that is on the earth." And there was good. And there was evening and there was morning, the fourth day.

And God said, "Let the waters swarm with swarms of living creatures, and let birds fly above the earth sprout vegetation, and every living creatures, and every living creatures, and every be lights in the expanse of the earth according to their kinds, and every living thing that moves on the earth." And God said, "Be fruitful and the earth." And God saw that it was good. And there was evening and there was morning, the fourth day.

Will soon make a llama pull request with an example how to train a small llama compatible (i.e. loadable by main) model from scratch on custom text data.
Before that I want to finish some more performance optimizations implementing operations directly.
Checkpoints need to be exported to llama compatible files as I used a simplified file structure for checkpoints to avoid excessive code copying from llama. Still need to add such an export to the example to make it usable in main. Hopefully llama does require too much modification for this, didn't look to hard into it yet. Code related to enum e_model probably needs some changes.
The example itself also needs some polishing, like reading currently hardcoded parameters from command line arguments, etc.

After the pull request I will continue by experimenting with LoRa finetuning, multi-query attention, flash attention, gradient accumulation & memmap based ctx to train with larger batch sizes.

@ggerganov
Copy link
Owner

@xaedes

Trained on genesis 1 for 64x16 iterations (256 n_emb, 4 n_layer, 32 n_ctx, 16 n_batch):

One possible application of these "baby" LLaMA models is for "Speculative sampling":

ggerganov/llama.cpp#630 (comment)

A paper claims about 2x faster inference can be achieve with such approach: https://arxiv.org/abs/2302.01318

@loretoparisi
Copy link

@xaedes

Trained on genesis 1 for 64x16 iterations (256 n_emb, 4 n_layer, 32 n_ctx, 16 n_batch):

One possible application of these "baby" LLaMA models is for "Speculative sampling":

ggerganov/llama.cpp#630 (comment)

A paper claims about 2x faster inference can be achieve with such approach: https://arxiv.org/abs/2302.01318

I'm looking at the code right now! The author of picoGPT (gpt-2 inference in pure NumPy) implemented the Speculative Sampling in Python and tested it on GPT-2 achieving a 2x speed up
https://github.com/jaymody/speculative-sampling

@loretoparisi
Copy link

It's worth to note that Sophia could be a valid alternative to AdamW optimizer, code is now available:

https://github.com/Liuhong99/Sophia

@ggerganov
Copy link
Owner

I could imagine a compilation pass over the computation graph changing data pointers to reuse some memory, not sure how much could really be saved by it - we already have some inplace operations after all. But if there is some potential it could be interesting to explore in the future.

The "scratch buffer" mechanism is something in this direction. It's not ideal and can be improved in many ways.
Here a couple of links to demonstrate it's current usage:

@matthiasgeihs
Copy link

Hey @xaedes, how is it going with Lora fine-tuning? Would be so cool to have this. Thanks for the great work!

@DamascusGit
Copy link

also curious for an update on this, it's still extremely relevant for so many people to have lora/qlora support on metal.

@xaedes
Copy link

xaedes commented Jul 28, 2023

Hi there, sorry for the long wait! I was on vacation for a few weeks and am now back working on this :)

Memory usage improvements (mainly gradient checkpointing & opt-adam improvements) for training are done, for which I will now start to make a pull request on llama repo - lots of changes from master to merge...

Development for LORA finetuning will then start based on this.

@xaedes
Copy link

xaedes commented Aug 18, 2023

LORA finetuning 3B model seems to mostly work now:
ggerganov/llama.cpp#2632

Bigger models probably as well, they just need more RAM.

@webpolis
Copy link

webpolis commented Sep 23, 2023

@xaedes any way today to use ggml with a GPT-J quantized model (using ggml) and a LoRa adapter trained using huggingface?

@saraalrawi
Copy link

saraalrawi commented Feb 21, 2024

@xaedes thanks a lot for the great work!
I have been trying to train Autoencoder/Variational Autoencoder for dimension reduction.
I have been struggling with the inference.

The question is: do I need to build an inference graph, which is basically just and encoder -> latent space, without re-initializing the model, such that I use the model's weights and biases?
Or how can I do inference?

I am asking because it is unclear to me how the inference is done in your code.

thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests