Using Accelerate for vector scale #193

philipturner · 2023-05-24T20:22:17Z

We could use Accelerate to scale the vector here, similarly to how add and exp use Accelerate.

Lines 3250 to 3277 in 2992df0

 inline static void ggml_vec_scale_f32(const int n, float * y, const float v) { 

 #if defined(GGML_SIMD) 

 const int np = (n & ~(GGML_F32_STEP - 1)); 

 GGML_F32_VEC vx = GGML_F32_VEC_SET1(v); 

 GGML_F32_VEC ay[GGML_F32_ARR]; 

 for (int i = 0; i < np; i += GGML_F32_STEP) { 

 for (int j = 0; j < GGML_F32_ARR; j++) { 

 ay[j] = GGML_F32_VEC_LOAD(y + i + j*GGML_F32_EPR); 

 ay[j] = GGML_F32_VEC_MUL(ay[j], vx); 

 GGML_F32_VEC_STORE(y + i + j*GGML_F32_EPR, ay[j]); 

 } 

 } 

 // leftovers 

 for (int i = np; i < n; ++i) { 

 y[i] *= v; 

 } 

 #else 

 // scalar 

 for (int i = 0; i < n; ++i) { 

 y[i] *= v; 

 } 

 #endif 

 }

https://developer.apple.com/documentation/accelerate/1450020-vdsp_vsmul

The text was updated successfully, but these errors were encountered:

jaeminSon · 2023-05-25T23:31:36Z

I naively thought adding "if defined" on the top and setting the type correctly for 'vDSP_vsmul' would solve the problem easily. But when I modify the code like the following, I get segmentation fault. What do you think is the problem?

inline static void ggml_vec_scale_f32(const int n, float * y, const float v) {
#if defined(GGML_USE_ACCELERATE)
    vDSP_vsmul(y, 1, y, (float*) &v, 1, n);
#elif defined(GGML_SIMD)
.... // codes below intact
`

philipturner · 2023-05-26T05:57:08Z

Try narrowing that into a standalone C++ or Swift program. Does the fault still happen?

Btw I think you could vastly improve the softmax part by writing vectorized code that fuses each kernel call. Calling into Accelerate this way makes it memory bound with most time spent reading and writing everything from L1.

philipturner · 2023-05-26T06:11:00Z

I don’t know why, but LLaMa.cpp is much slower than it should theoretically be. Going by @ggerganov’s CPU bandwidth (200 GB/s), the CPU cores should eat the entire 6.7B-q4.5 model in 16 ms. But for some reason the token latency is 43 ms.

That’s a 2-3x speed up we could have by redesigning the code, not just an incremental speed up.

ggerganov · 2023-05-26T06:17:53Z

Going by @ggerganov’s CPU bandwidth (200 GB/s)

This number is Apple's claim for the memory bandwidth of M1 Pro if I remember correctly.
I haven't been able to reproduce this speed. The best I've seen is ~80-90 GB/s: ggerganov/llama.cpp#34 (comment)

And regarding single thread, it's no more than 40GB/s

philipturner · 2023-05-26T06:31:27Z

Single thread has reached 100 GB/s in some benchmarks. When it’s occupied with other work or code is improperly written, it can’t utilize all of that. But then there are 8 cores total to harness that bandwidth.

On GPU (M1 Max), I have achieved 378 GB/s out of 400 GB/s in a custom Metal blit command. It requires careful tuning - aligning the data structure to 64B boundaries. From what I can tell, LLaMa.cpp is not aligned.

https://github.com/philipturner/metal-usm/blob/23e9f324fd3e4ecb1078cf6b211bd25753de718b/BlitEncoderAlternative/MainFile.swift#L31-L60

Going so far as to shuffle data around in threadgroup memory, just so whatever it eats and spits out is 64B aligned:

https://github.com/philipturner/metal-usm/blob/23e9f324fd3e4ecb1078cf6b211bd25753de718b/BlitEncoderAlternative/Kernels.metal#L177-L200

jaeminSon · 2023-05-26T11:24:39Z

Try narrowing that into a standalone C++ or Swift program. Does the fault still happen?

Shame! it should be

vDSP_vsmul(y, 1, (float*) &v, y, 1, n);

No segment fault anymore!

jaeminSon · 2023-05-26T11:58:57Z

I ran several times but SIMD tends to be faster.

hardware: MacBook Pro (Retina, 13-inch, Early 2015), 2.7 GHz dual core Intel Core i5, 8GB 1867 MHz DDR3, Intel Iris Graphics 6100 1536 MB
os: Mac OS Monterey (v12.6)
gpt-model: Cerebras-GPT-111M

output of GGML_SIMD

this is a tokenization test, but the user is getting the response.
I'm trying to test for a method called response:
         if (user == null)
             return response.getJSON().text()

        response.getJSON().text()

        response.getJSON().text()

This is the code that works. The exception is the method that I use to get the response.
I would appreciate any help, in any event.

A:

You're getting the response.
This is the tokenization test

It's probably the first time you've used it, but you're not exactly sure how to do it.
There are many things you can do to improve the way you are able to work with this code. It's the only way you can change a tokenization test for the model,

main: mem per token =  1712332 bytes
main:     load time =   715.21 ms
main:   sample time =    59.12 ms
main:  predict time =  9944.77 ms / 48.51 ms per token
main:    total time = 12506.17 ms

output using vDSP_vsmul,

main: prompt: 'this is a tokenization test'
main: number of tokens in prompt = 6, first 8 tokens: 5661 318 257 11241 1634 1332 

this is a tokenization test with this method and this method has a user-defined tokenizer.
                                                                                                                                                                                         

main: mem per token =  1712332 bytes
main:     load time =   842.32 ms
main:   sample time =    61.45 ms
main:  predict time = 11836.78 ms / 57.74 ms per token
main:    total time = 15593.00 ms

philipturner · 2023-05-26T12:04:51Z

Try replacing some other Accelerate calls with vectorized code. Bonus if you can fuse two elementwise operations of the softmax without writing the element back to memory in between.

  // NOTE: Softmax is expected to consume the most time, due to the latency of
  // each function call and inability to keep the elements in registers.
  // Consider writing vectorized Swift code for a fairer comparison to GPU.
  
  // Pseudocode for softmax operation:
  // (1) find maximum element in each row
  // (2) subtract the maximum from all elements
  // (3) apply the exponential operator to all elements
  // (4) find the sum of each row
  // (5) divide all elements by the sum
  for i in 0..<UInt(NQ) {
    // The elements to operate on.
    let n = UInt(NKV)
    let row = _QK + Int(i * n)
    
    // (1)
    var maxValue: Float = 0
    vDSP_maxv(row, 1, &maxValue, n)
    assert(maxValue != 0)
    
    // (2)
    maxValue = -maxValue
    vDSP_vsadd(row, 1, &maxValue, row, 1, n)
    
    // (3)
    vvexpf(row, row, &NKV)
    
    // (4)
    var sumValue: Float = 0
    vDSP_sve(row, 1, &sumValue, n)
    
    // (5)
    sumValue = simd_precise_recip(sumValue)
    vDSP_vsmul(row, 1, &sumValue, row, 1, n)
  }

Becomes

  for i in 0..<UInt(NQ) {
    // The elements to operate on.
    let n = UInt(NKV)
    let row = _QK + Int(i * n)
    
    // (1)
    var maxValue: Float = 0
    vDSP_maxv(row, 1, &maxValue, n)
    assert(maxValue != 0)

    // PSEUDOCODE STARTS
    typealias Vector = SIMD16<Float> // Try multiple vector lengths.
    var sumValueVec: Vector = .zero
    for i in 0..<n / Vector.elementCount { // TODO: Handle the last iteration carefully.
       let i_amp = i * Vector.elementCount
       let pointer = (row + i_amp).reinterpret_cast(Vector.self)

       // (2)
       // (3)
       let value = exp(pointer.pointee - maxValue)
       pointer.pointee = value

       // (4)
       sumValueVec += value
    }
    var sumValue: Float = sumValueVec.sum()
    // PSEUDOCODE ENDS
    
    // (5)
    sumValue = simd_precise_recip(sumValue)
    vDSP_vsmul(row, 1, &sumValue, row, 1, n)
  }

nullhook · 2023-07-05T22:25:15Z

Try narrowing that into a standalone C++ or Swift program. Does the fault still happen?

Shame! it should be
vDSP_vsmul(y, 1, (float*) &v, y, 1, n);
No segment fault anymore!

Why are you casting? it seems redundant.

@Const-me

* Add AVX2 version of ggml_vec_dot_q4_1 * Small optimisations to q4_1 dot product (@Const-me) * Rearrange Q4_1 quantization to work for multipart models. (Fix ggerganov#152) * Fix ggml_vec_mad_q4_1 too * Fix non-vectorised q4_1 vec mul

ggerganov added enhancement New feature or request good first issue Good for newcomers labels May 25, 2023

nullhook mentioned this issue Jul 13, 2023

Added vector scaling using Accelerate #380

Merged

philipturner closed this as completed Jul 14, 2023

philipturner reopened this Jul 14, 2023

ggerganov closed this as completed in #380 Jul 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using Accelerate for vector scale #193

Using Accelerate for vector scale #193

philipturner commented May 24, 2023

jaeminSon commented May 25, 2023

philipturner commented May 26, 2023 •

edited

Loading

philipturner commented May 26, 2023

ggerganov commented May 26, 2023 •

edited

Loading

philipturner commented May 26, 2023 •

edited

Loading

jaeminSon commented May 26, 2023

jaeminSon commented May 26, 2023

philipturner commented May 26, 2023 •

edited

Loading

nullhook commented Jul 5, 2023

Using Accelerate for vector scale #193

Using Accelerate for vector scale #193

Comments

philipturner commented May 24, 2023

jaeminSon commented May 25, 2023

philipturner commented May 26, 2023 • edited Loading

philipturner commented May 26, 2023

ggerganov commented May 26, 2023 • edited Loading

philipturner commented May 26, 2023 • edited Loading

jaeminSon commented May 26, 2023

jaeminSon commented May 26, 2023

philipturner commented May 26, 2023 • edited Loading

nullhook commented Jul 5, 2023

philipturner commented May 26, 2023 •

edited

Loading

ggerganov commented May 26, 2023 •

edited

Loading

philipturner commented May 26, 2023 •

edited

Loading

philipturner commented May 26, 2023 •

edited

Loading