Llama 3.1 #38

laggui · 2024-08-22T19:06:38Z

Ported the new Llama weights to our implementation along with their custom RoPE frequency scaling.

Made a couple additional changes to fix minor things along the way:

Changed the default llama3 model for the chat example to use 3.1
- Added command line argument to select between the versions
Added multiple stop tokens condition to properly support end of turn + end of text for newer llama models
Fixed the max_seq_len to be properly propagated to (command argument value was not used before)
Added import feature flag to keep pytorch weights loading code optional

TODO:

Use Burn version 0.14

laggui · 2024-08-30T18:52:36Z

Some notes:

wgpu backend fails due to lack of memory (even with TinyLlama):

// WSL
Loading record...
thread 'main' panicked at /home/laggui/.cargo/registry/src/index.crates.io-6f17d22bba15001f/cubecl-runtime-0.2.0/src/memory_management/dynamic.rs:156:9:
No memory pool big enough to reserve 262144000 bytes.
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

// Windows
Loading record...
thread 'main' panicked at C:\Users\guila\.cargo\registry\src\index.crates.io-6f17d22bba15001f\wgpu-22.1.0\src\backend\wgpu_core.rs:3411:5:
wgpu error: Validation Error

Caused by:
  In Device::create_buffer
    Not enough memory left.


note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
error: process didn't exit successfully: `target\release\examples\chat.exe` (exit code: 101)

cuda backend currently uses f32 because f16 has some compilation errors:

Loading record...
Loaded in 1s
Processing prompt: How many helicopters can a human eat in one sitting?
thread 'main' panicked at /home/laggui/.cargo/registry/src/index.crates.io-6f17d22bba15001f/cubecl-cuda-0.2.0/src/compute/server.rs:237:17:
[Compilation Error]
    default_program(56): error: class "__half2" has no member "i_0"
      l_0_8.i_0 = __half(0.0);
            ^
    default_program(57): error: class "__half2" has no member "i_1"
      l_0_8.i_1 = __half(0.0);
            ^
    default_program(66): error: class "__half2" has no member "i_0"
      l_0_9.i_0 = __half(0.0);
            ^
    default_program(67): error: class "__half2" has no member "i_1"
      l_0_9.i_1 = __half(0.0);
            ^
    4 errors detected in the compilation of "default_program".
[Source]
#include <cuda_fp16.h>
typedef unsigned int uint;


extern "C" __global__ void kernel(
__half2 input_0[],__half2 input_1[],uint info[]
) {

    int3 absoluteIdx = make_int3(
        blockIdx.x * blockDim.x + threadIdx.x,
        blockIdx.y * blockDim.y + threadIdx.y,
        blockIdx.z * blockDim.z + threadIdx.z
    );

    uint idxGlobal = (absoluteIdx.z * gridDim.x * blockDim.x * gridDim.y * blockDim.y) + (absoluteIdx.y * gridDim.x * blockDim.x) + absoluteIdx.x;
uint rank = info[0];
uint rank_2 = rank * 2;
uint l_0_0;
uint l_0_1;
uint l_0_2;
uint l_0_3;
bool l_0_4;
uint l_0_5;
uint l_0_6;
uint l_0_7;
__half2 l_0_8;
__half2 l_0_9;
l_0_0 = idxGlobal;
l_0_1 = idxGlobal;
l_0_2 = idxGlobal;
l_0_3 = info[(2 * 2 * info[0]) + 1] / 2;
l_0_4 = l_0_0 >= l_0_3;
if (l_0_4) {
return;}
l_0_3 = l_0_0 * uint(2);
l_0_5 = uint(0);

for (uint l_1_0 = uint(0); l_1_0 < rank; ++l_1_0) {
l_0_6 = info[(0 * rank_2) + l_1_0 + 1];
l_0_6 = l_0_3 / l_0_6;
l_0_7 = info[(1 * rank_2) + rank + l_1_0 + 1];
l_0_6 = l_0_6 % l_0_7;
l_0_7 = info[(1 * rank_2) + l_1_0 + 1];
l_0_6 = l_0_6 * l_0_7;
l_0_5 = l_0_5 + l_0_6;
}
l_0_5 = l_0_5 / uint(2);
l_0_2 = l_0_5;
uint l_0_10;
bool l_0_11;
l_0_10 = info[(2 * 2 * info[0]) + 1] / 2;
l_0_11 = l_0_1 < l_0_10;
if (l_0_11) {
l_0_8 = input_0[l_0_1];
} else {
l_0_8.i_0 = __half(0.0);
l_0_8.i_1 = __half(0.0);
}
uint l_0_12;
bool l_0_13;
l_0_12 = info[(2 * 2 * info[0]) + 2] / 2;
l_0_13 = l_0_2 < l_0_12;
if (l_0_13) {
l_0_9 = input_1[l_0_2];
} else {
l_0_9.i_0 = __half(0.0);
l_0_9.i_1 = __half(0.0);
}
l_0_8 = l_0_8 * l_0_9;
uint l_0_14;
bool l_0_15;
l_0_14 = info[(2 * 2 * info[0]) + 1] / 2;
l_0_15 = l_0_0 < l_0_14;
if (l_0_15) {
input_0[l_0_0] = l_0_8;
}

}
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

laggui added 6 commits August 21, 2024 08:33

Add Llama 3.1

8f7f707

Use provided max_seq_len and remove llama3_1 feature flag

7a8228e

Fix llama3 tokenizer w/ special tokens

31a3c08

Add Llama3 enum for model selection

4084e6d

Add end of turn token w/ stop tokens

215f092

Add Llama-3.1-8B-Instruct link

9dd49a9

This was referenced Aug 22, 2024

Fix root-mean-square precision issue tracel-ai/burn#2193

Merged

Add RoPE init_with_frequency_scaling tracel-ai/burn#2194

Merged

laggui added 2 commits August 23, 2024 12:37

Add llama 3.1 tokenizer new tokens

3b08aff

Update to burn 0.14 and add import feature flag

65f0e24

laggui marked this pull request as ready for review August 29, 2024 16:43

laggui requested a review from nathanielsimard August 29, 2024 17:14

Add cuda backend

797b652

nathanielsimard approved these changes Sep 3, 2024

View reviewed changes

Add known issues

da39410

This was referenced Sep 3, 2024

[wgpu] Memory management strategy too greedy? tracel-ai/cubecl#90

Open

[cuda] f16 compilation error tracel-ai/cubecl#91

Closed

laggui added 5 commits September 3, 2024 09:55

Add llama 3.1 link

55822e2

Add llama 3.1 community license agreement

a5ceb91

Remove space

7a5cfbf

Remove deprecated note regarding binary weights (we use mpk now)

97f52a4

Add missing cuda backend mention

9b2ea06

laggui merged commit 877996b into main Sep 3, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama 3.1 #38

Llama 3.1 #38

laggui commented Aug 22, 2024 •

edited

Loading

laggui commented Aug 30, 2024

Llama 3.1 #38

Llama 3.1 #38

Conversation

laggui commented Aug 22, 2024 • edited Loading

laggui commented Aug 30, 2024

laggui commented Aug 22, 2024 •

edited

Loading