[Feature request] Implement 8-bit GPT-J #5

pablogranolabar · 2022-11-13T18:25:42Z

Results in ~11Gb weights vs. 16Gb, implemented in PyTorch now as load_in_8bit=True:

https://huggingface.co/hivemind/gpt-j-6B-8bit

* use hipblas based on cublas * Update Makefile for the Cuda kernels * Expand arch list and make it overrideable * Fix multi GPU on multiple amd architectures with rocblas_initialize() (ggerganov#5) * add hipBLAS to README * new build arg LLAMA_CUDA_MMQ_Y * fix half2 decomposition * Add intrinsics polyfills for AMD * AMD assembly optimized __dp4a * Allow overriding CC_TURING * use "ROCm" instead of "CUDA" * ignore all build dirs * Add Dockerfiles * fix llama-bench * fix -nommq help for non CUDA/HIP --------- Co-authored-by: YellowRoseCx <[email protected]> Co-authored-by: ardfork <[email protected]> Co-authored-by: funnbot <[email protected]> Co-authored-by: Engininja2 <[email protected]> Co-authored-by: Kerfuffle <[email protected]> Co-authored-by: jammm <[email protected]> Co-authored-by: jdecourval <[email protected]>

ggerganov added the enhancement New feature or request label Nov 14, 2022

ggerganov mentioned this issue Feb 26, 2023

4-bit Integer quantisation #27

Merged

8 tasks

katsu560 mentioned this issue Mar 18, 2023

add OpenBLAS detection and modify tests codes #40

Merged

ggerganov closed this as completed in #27 Mar 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature request] Implement 8-bit GPT-J #5

[Feature request] Implement 8-bit GPT-J #5

pablogranolabar commented Nov 13, 2022

[Feature request] Implement 8-bit GPT-J #5

[Feature request] Implement 8-bit GPT-J #5

Comments

pablogranolabar commented Nov 13, 2022