Clarification on API Endpoint: /v1/completions vs /v1/chat/completions #1637

gerayking · 2024-03-26T02:48:14Z

code in gguf.py 62 lines

response = requests.post(
                    f"{self.base_url}/v1/completions", json=request
                )

I've been exploring the publicly available OpenAPI documentation and came across a point of confusion regarding the API endpoints. Specifically, I have two questions that I hope can be clarified:

In the OpenAPI documentation, I could not find an endpoint named {url}/v1/completions. I was wondering if this is an oversight in the documentation or if the endpoint is not supported.

Could you please confirm if the correct endpoint for obtaining completions is /v1/chat/completions? I'm asking because this seems to be the relevant endpoint for what I'm trying to achieve, but I want to ensure I'm using the API correctly.

Any clarification or additional information you can provide would be greatly appreciated.

The text was updated successfully, but these errors were encountered:

haileyschoelkopf · 2024-03-26T20:29:59Z

Hi! Could you share a link to the documentation you are referencing?

This GGUF integration was designed around base LMs (e.g., ones not using a chat template), and hence v1/completions was at least the right choice at the time of integration. No matter the solution we should probably document this better for users, and potentially support both options.

gerayking · 2024-03-27T02:00:34Z

The reference of documentation is that llama_server.
And I try to rewrtire gguf.py like follow:

import logging
import time

import requests
from requests.exceptions import RequestException
from tqdm import tqdm

from lm_eval.api.model import LM
from lm_eval.api.registry import register_model


logger = logging.getLogger(__name__)


def get_result(logprobs, context_length):
    is_greedy = True
    continuation_logprobs = 0 
    idx = context_length + 1 

    for i in range(idx, len(logprobs)):
        current_probs = logprobs[i]["probs"]  
        token = logprobs[i]["content"]  

        continuation_logprobs += current_probs[0]["prob"]

        top_token = ""
        top_rate = -1  # 
        for t in current_probs:
            if t["prob"] > top_rate:  
                top_rate = t["prob"]
                top_token = t["tok_str"]

        # 检查是否贪婪
        if top_token != token:
            is_greedy = False
            break

    return continuation_logprobs, is_greedy



@register_model("gguf", "ggml")
class GGUFLM(LM):
    def __init__(self, base_url=None, max_length=2048, **kwargs):
        super().__init__()
        self.base_url = base_url
        assert self.base_url, "must pass `base_url` to use GGUF LM!"
        self.logprobs = 10
        self.temperature = 0.0
        self.max_length = max_length

    def gguf_completion(
        self, context, continuation=None, stop=None, retries=3, delay=5, **kwargs
    ):
        print(context)
        for _ in range(retries):
            try:
                prompt = context
                request = {
                    "prompt": prompt,
                    "n_probs": self.logprobs,
                    "temperature": self.temperature,
                    "n_predict": self.max_length
                }
                if continuation:
                    prompt += continuation
                    request.update({"prompt": prompt, "max_tokens": 1, "echo": True})
                if stop is not None:
                    request["stop"] = stop
                response = requests.post(
                    f"{self.base_url}/v1/completions", json=request
                )
                response.raise_for_status()
                return response.json()
            except RequestException as e:
                logger.error(f"RequestException: {e}")
                time.sleep(delay)  # wait before retrying
        else:
            raise Exception(f"Failed to get a valid response after {retries} retries.")

    def loglikelihood(self, requests, disable_tqdm: bool = False):
        if not requests:
            return []
        res = []
        for context, continuation in tqdm(
            [req.args for req in requests], disable=disable_tqdm
        ):
            response = self.gguf_completion(context=context, continuation=continuation)
            if response and "content" in response and response["content"]:
                choice = response
                logprobs = choice["completion_probabilities"]
                if (
                    logprobs
                ):
                    logprob, is_greedy = get_result(logprobs, len(context))
                    res.append((logprob, is_greedy))
                else:
                    logger.warning(
                        "Invalid logprobs data. Expected 'logprobs' to contain 'token_logprobs' list."
                    )
            else:
                logger.error(
                    f"Invalid response for loglikelihood. Response: {response}"
                )
                assert False
        return res

    def generate_until(self, requests, disable_tqdm: bool = False):
        if not requests:
            return []

        res = []
        for request in tqdm([req.args for req in requests], disable=disable_tqdm):
            inp = request[0]
            request_args = request[1]
            until = request_args.get("until", ["</s>"])
            response = self.gguf_completion(context=inp, stop=until)
            if response and "content" in response and response["content"]:
                choice = response
                if "content" in choice:
                    generated_text = choice["content"].strip()
                    res.append(generated_text)
                else:
                    logger.error(
                        f"Invalid response for greedy_until. Response: {response}"
                    )
                    res.append(None)  # Add default value in case of error
            else:
                logger.error(f"Invalid response for greedy_until. Response: {response}")
                res.append(None)  # Add default value in case of error
        return res

    def loglikelihood_rolling(self, requests, disable_tqdm: bool = False):
        raise NotImplementedError(
            "loglikelihood_rolling not yet supported for GGUF models"
        )

But, it maybe wrong with loglikelihood, I run below code and get accuracy just 0.49

import json
from lm_eval.models.gguf import GGUFLM
from lm_eval import simple_evaluate

lm = GGUFLM(base_url="http:https://localhost:8080")    
# tasks.initialize_tasks()
results = simple_evaluate(model=lm,tasks=["piqa"]) 
#将results中的数据导出到json文件中  
filtered_results = results.copy()  
filtered_results = {key: value for key, value in results.items() if key != "samples"}  
json_filtered_results = json.dumps(filtered_results, indent=4)  
with open("results.json", "w") as json_file:
    json_file.write(json_filtered_results)

result:

{
    "results": {
        "piqa": {
            "acc,none": 0.4923830250272035,
            "acc_stderr,none": 0.011664470424044978,
            "acc_norm,none": 0.4923830250272035,
            "acc_norm_stderr,none": 0.011664470424044978,
            "alias": "piqa"
        }
    },
    "group_subtasks": {
        "piqa": []
    },
    "configs": {
        "piqa": {
            "task": "piqa",
            "dataset_path": "piqa",
            "training_split": "train",
            "validation_split": "validation",
            "doc_to_text": "Question: {{goal}}\nAnswer:",
            "doc_to_target": "label",
            "doc_to_choice": "{{[sol1, sol2]}}",
            "description": "",
            "target_delimiter": " ",
            "fewshot_delimiter": "\n\n",
            "num_fewshot": 0,
            "metric_list": [
                {
                    "metric": "acc",
                    "aggregation": "mean",
                    "higher_is_better": true
                },
                {
                    "metric": "acc_norm",
                    "aggregation": "mean",
                    "higher_is_better": true
                }
            ],
            "output_type": "multiple_choice",
            "repeats": 1,
            "should_decontaminate": true,
            "doc_to_decontamination_query": "goal",
            "metadata": {
                "version": 1.0
            }
        }
    },
    "versions": {
        "piqa": 1.0
    },
    "n-shot": {
        "piqa": 0
    },
    "config": {
        "model": "GGUFLM",
        "model_args": null,
        "batch_size": null,
        "batch_sizes": [],
        "device": null,
        "use_cache": null,
        "limit": null,
        "bootstrap_iters": 100000,
        "gen_kwargs": null
    },
    "git_hash": "4600d6bf",
    "date": 1711460633.7369168,
    "pretty_env_info": "PyTorch version: 2.2.1\nIs debug build: False\nCUDA used to build PyTorch: None\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 20.04.6 LTS (aarch64)\nGCC version: (GCC) 11.2.0\nClang version: Could not collect\nCMake version: version 3.16.3\nLibc version: glibc-2.31\n\nPython version: 3.10.14 (main, Mar 21 2024, 16:18:23) [GCC 11.2.0] (64-bit runtime)\nPython platform: Linux-5.4.0-153-generic-aarch64-with-glibc2.31\nIs CUDA available: False\nCUDA runtime version: No CUDA\nCUDA_MODULE_LOADING set to: N/A\nGPU models and configuration: No CUDA\nNvidia driver version: No CUDA\ncuDNN version: No CUDA\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture:                    aarch64\nCPU op-mode(s):                  32-bit, 64-bit\nByte Order:                      Little Endian\nCPU(s):                          8\nOn-line CPU(s) list:             0-7\nThread(s) per core:              1\nCore(s) per socket:              8\nSocket(s):                       1\nNUMA node(s):                    1\nVendor ID:                       ARM\nModel:                           0\nStepping:                        r0p0\nCPU max MHz:                     3000.0000\nCPU min MHz:                     3000.0000\nBogoMIPS:                        100.00\nL1d cache:                       512 KiB\nL1i cache:                       512 KiB\nL2 cache:                        8 MiB\nL3 cache:                        64 MiB\nNUMA node0 CPU(s):               0-7\nVulnerability Itlb multihit:     Not affected\nVulnerability L1tf:              Not affected\nVulnerability Mds:               Not affected\nVulnerability Meltdown:          Not affected\nVulnerability Mmio stale data:   Not affected\nVulnerability Retbleed:          Not affected\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl\nVulnerability Spectre v1:        Mitigation; __user pointer sanitization\nVulnerability Spectre v2:        Mitigation; CSV2, BHB\nVulnerability Srbds:             Not affected\nVulnerability Tsx async abort:   Not affected\nFlags:                           fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs sb dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] torch==2.2.1\n[conda] numpy                     1.26.4                   pypi_0    pypi\n[conda] torch                     2.2.1                    pypi_0    pypi",
    "transformers_version": "4.39.1",
    "upper_git_hash": null
}

L1-M1ng · 2024-04-07T11:56:11Z

hi, I met the same problem as yours, have you solved this problem?

haileyschoelkopf · 2024-04-07T13:43:44Z

What model here is being evaluated? I will try to look into this soon.

gerayking · 2024-04-07T15:29:16Z

What model here is being evaluated? I will try to look into this soon.

Qwen1.8B
I discovered that the original script gguf.py, when matched with llama-cpp-python to start the server, is operational. However, the accuracy ([acc, none]) is only 0.68, which might indicate an error. Could you test it to confirm?

gerayking · 2024-04-07T15:31:23Z

hi, I met the same problem as yours, have you solved this problem?

try to use llama-cpp-python replace llama.cpp start server

L1-M1ng · 2024-04-08T03:58:19Z

hi, I met the same problem as yours, have you solved this problem?

try to use llama-cpp-python replace llama.cpp start server

Thanks for your reply, I' ll try it later.

L1-M1ng · 2024-04-08T04:05:07Z

What model here is being evaluated? I will try to look into this soon.

Qwen1.8B I discovered that the original script gguf.py, when matched with llama-cpp-python to start the server, is operational. However, the accuracy ([acc, none]) is only 0.68, which might indicate an error. Could you test it to confirm?

I found that the way to compute log_logits_sum between gguf.py and huggingface.py are different.
In huggingface.py, it obtain log-probs at the corresponding continuation token indices.

but in gguf.py, it just compute the sum of the prob of model output token, do not select the prob of token by continuation token indices.

Am I right to understand it this way? @haileyschoelkopf

haileyschoelkopf · 2024-04-08T12:28:43Z

In GGUF it should determine based on the offsets which tokens are part of the continuation and which are not (the while loop in the screenshot skips any context tokens).

I've definitely run this and gotten equivalent performance on a Llama-7B model compared against HF at the time... what is Qwen-1.8B's piqa accuracy in the huggingface implementation?

gerayking · 2024-04-08T13:35:36Z

In GGUF it should determine based on the offsets which tokens are part of the continuation and which are not (the while loop in the screenshot skips any context tokens).

I've definitely run this and gotten equivalent performance on a Llama-7B model compared against HF at the time... what is Qwen-1.8B's piqa accuracy in the huggingface implementation?

Thank you for your assistance. I've tested Qwen-1.8B's accuracy on PIQA and found it to be 0.68, which I am not sure it is right?

gerayking mentioned this issue Mar 26, 2024

llama / gguf interface broken? #1437

Open

haileyschoelkopf self-assigned this Mar 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarification on API Endpoint: /v1/completions vs /v1/chat/completions #1637

Clarification on API Endpoint: /v1/completions vs /v1/chat/completions #1637

gerayking commented Mar 26, 2024

haileyschoelkopf commented Mar 26, 2024

gerayking commented Mar 27, 2024 •

edited

Loading

L1-M1ng commented Apr 7, 2024

haileyschoelkopf commented Apr 7, 2024

gerayking commented Apr 7, 2024

gerayking commented Apr 7, 2024

L1-M1ng commented Apr 8, 2024

L1-M1ng commented Apr 8, 2024 •

edited

Loading

haileyschoelkopf commented Apr 8, 2024

gerayking commented Apr 8, 2024

Clarification on API Endpoint: /v1/completions vs /v1/chat/completions #1637

Clarification on API Endpoint: /v1/completions vs /v1/chat/completions #1637

Comments

gerayking commented Mar 26, 2024

haileyschoelkopf commented Mar 26, 2024

gerayking commented Mar 27, 2024 • edited Loading

L1-M1ng commented Apr 7, 2024

haileyschoelkopf commented Apr 7, 2024

gerayking commented Apr 7, 2024

gerayking commented Apr 7, 2024

L1-M1ng commented Apr 8, 2024

L1-M1ng commented Apr 8, 2024 • edited Loading

haileyschoelkopf commented Apr 8, 2024

gerayking commented Apr 8, 2024

gerayking commented Mar 27, 2024 •

edited

Loading

L1-M1ng commented Apr 8, 2024 •

edited

Loading