Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduction of CUDA Graphs to LLama.cpp #6766

Merged
merged 20 commits into from
May 8, 2024

Commits on Apr 19, 2024

  1. Configuration menu
    Copy the full SHA
    cec409a View commit details
    Browse the repository at this point in the history

Commits on Apr 22, 2024

  1. FIx issues raised in comments

    agray3 committed Apr 22, 2024
    Configuration menu
    Copy the full SHA
    c8dd0e7 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    800f4fe View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    c2691d9 View commit details
    Browse the repository at this point in the history

Commits on Apr 23, 2024

  1. Configuration menu
    Copy the full SHA
    df4719e View commit details
    Browse the repository at this point in the history

Commits on Apr 24, 2024

  1. added missing CUDA_CHECKs

    agray3 committed Apr 24, 2024
    Configuration menu
    Copy the full SHA
    c3d4ead View commit details
    Browse the repository at this point in the history
  2. Addressed comments

    agray3 committed Apr 24, 2024
    Configuration menu
    Copy the full SHA
    d403b18 View commit details
    Browse the repository at this point in the history
  3. further addressed comments

    agray3 committed Apr 24, 2024
    Configuration menu
    Copy the full SHA
    4087596 View commit details
    Browse the repository at this point in the history

Commits on Apr 25, 2024

  1. Configuration menu
    Copy the full SHA
    0640427 View commit details
    Browse the repository at this point in the history

Commits on Apr 29, 2024

  1. Configuration menu
    Copy the full SHA
    9c57861 View commit details
    Browse the repository at this point in the history

Commits on Apr 30, 2024

  1. Configuration menu
    Copy the full SHA
    d44e0fb View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    eb9f15f View commit details
    Browse the repository at this point in the history
  3. Revert "With mechanism to fall back if graph capture fails"

    This reverts commit eb9f15f.
    agray3 committed Apr 30, 2024
    Configuration menu
    Copy the full SHA
    909e4c6 View commit details
    Browse the repository at this point in the history

Commits on May 1, 2024

  1. Configuration menu
    Copy the full SHA
    5819950 View commit details
    Browse the repository at this point in the history

Commits on May 2, 2024

  1. Configuration menu
    Copy the full SHA
    44af096 View commit details
    Browse the repository at this point in the history

Commits on May 7, 2024

  1. Configuration menu
    Copy the full SHA
    4e1f2a0 View commit details
    Browse the repository at this point in the history
  2. - renamed GGML_ALLOW_CUDA_GRAPHS to GGML_CUDA_USE_GRAPHS

    - rename env variable to disable CUDA graphs to GGML_CUDA_DISABLE_GRAPHS
    
    - updated Makefile build to enable CUDA graphs
    
    - removed graph capture failure checking in ggml_cuda_error
      using a global variable to track this is not thread safe, but I am also not safistied with checking an error by string
      if this is necessary to workaround some issues with graph capture with eg. cuBLAS, we can pass the ggml_backend_cuda_context to the error checking macro and store the result in the context
    
    - fixed several resource leaks
    
    - fixed issue with zero node graphs
    
    - changed fixed size arrays to vectors
    
    - removed the count of number of evaluations before start capturing, and instead changed the capture mode to relaxed
    
    - removed the check for multiple devices so that it is still possible to use a single device, instead checks for split buffers to disable cuda graphs with -sm row
    
    - changed the op for checking batch size to GGML_OP_ADD, should be more reliable than GGML_OP_SOFT_MAX
    
    - code style fixes
    
    - things to look into
      - VRAM usage of the cudaGraphExec_t, if it is significant we may need to make it optional
      - possibility of using cudaStreamBeginCaptureToGraph to keep track of which ggml graph nodes correspond to which cuda graph nodes
    slaren committed May 7, 2024
    Configuration menu
    Copy the full SHA
    e830949 View commit details
    Browse the repository at this point in the history

Commits on May 8, 2024

  1. fix build without cuda graphs

    slaren committed May 8, 2024
    Configuration menu
    Copy the full SHA
    a4c9b90 View commit details
    Browse the repository at this point in the history
  2. remove outdated comment

    slaren committed May 8, 2024
    Configuration menu
    Copy the full SHA
    ab40e66 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    f42312e View commit details
    Browse the repository at this point in the history