Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification on API Endpoint: /v1/completions vs /v1/chat/completions #1637

Open
gerayking opened this issue Mar 26, 2024 · 10 comments
Open
Assignees

Comments

@gerayking
Copy link

code in gguf.py 62 lines

response = requests.post(
                    f"{self.base_url}/v1/completions", json=request
                )

I've been exploring the publicly available OpenAPI documentation and came across a point of confusion regarding the API endpoints. Specifically, I have two questions that I hope can be clarified:

In the OpenAPI documentation, I could not find an endpoint named {url}/v1/completions. I was wondering if this is an oversight in the documentation or if the endpoint is not supported.

Could you please confirm if the correct endpoint for obtaining completions is /v1/chat/completions? I'm asking because this seems to be the relevant endpoint for what I'm trying to achieve, but I want to ensure I'm using the API correctly.

Any clarification or additional information you can provide would be greatly appreciated.

@haileyschoelkopf
Copy link
Contributor

Hi! Could you share a link to the documentation you are referencing?

This GGUF integration was designed around base LMs (e.g., ones not using a chat template), and hence v1/completions was at least the right choice at the time of integration. No matter the solution we should probably document this better for users, and potentially support both options.

@gerayking
Copy link
Author

gerayking commented Mar 27, 2024

The reference of documentation is that llama_server.
And I try to rewrtire gguf.py like follow:

import logging
import time

import requests
from requests.exceptions import RequestException
from tqdm import tqdm

from lm_eval.api.model import LM
from lm_eval.api.registry import register_model


logger = logging.getLogger(__name__)


def get_result(logprobs, context_length):
    is_greedy = True
    continuation_logprobs = 0 
    idx = context_length + 1 

    for i in range(idx, len(logprobs)):
        current_probs = logprobs[i]["probs"]  
        token = logprobs[i]["content"]  

        continuation_logprobs += current_probs[0]["prob"]

        top_token = ""
        top_rate = -1  # 
        for t in current_probs:
            if t["prob"] > top_rate:  
                top_rate = t["prob"]
                top_token = t["tok_str"]

        # 检查是否贪婪
        if top_token != token:
            is_greedy = False
            break

    return continuation_logprobs, is_greedy



@register_model("gguf", "ggml")
class GGUFLM(LM):
    def __init__(self, base_url=None, max_length=2048, **kwargs):
        super().__init__()
        self.base_url = base_url
        assert self.base_url, "must pass `base_url` to use GGUF LM!"
        self.logprobs = 10
        self.temperature = 0.0
        self.max_length = max_length

    def gguf_completion(
        self, context, continuation=None, stop=None, retries=3, delay=5, **kwargs
    ):
        print(context)
        for _ in range(retries):
            try:
                prompt = context
                request = {
                    "prompt": prompt,
                    "n_probs": self.logprobs,
                    "temperature": self.temperature,
                    "n_predict": self.max_length
                }
                if continuation:
                    prompt += continuation
                    request.update({"prompt": prompt, "max_tokens": 1, "echo": True})
                if stop is not None:
                    request["stop"] = stop
                response = requests.post(
                    f"{self.base_url}/v1/completions", json=request
                )
                response.raise_for_status()
                return response.json()
            except RequestException as e:
                logger.error(f"RequestException: {e}")
                time.sleep(delay)  # wait before retrying
        else:
            raise Exception(f"Failed to get a valid response after {retries} retries.")

    def loglikelihood(self, requests, disable_tqdm: bool = False):
        if not requests:
            return []
        res = []
        for context, continuation in tqdm(
            [req.args for req in requests], disable=disable_tqdm
        ):
            response = self.gguf_completion(context=context, continuation=continuation)
            if response and "content" in response and response["content"]:
                choice = response
                logprobs = choice["completion_probabilities"]
                if (
                    logprobs
                ):
                    logprob, is_greedy = get_result(logprobs, len(context))
                    res.append((logprob, is_greedy))
                else:
                    logger.warning(
                        "Invalid logprobs data. Expected 'logprobs' to contain 'token_logprobs' list."
                    )
            else:
                logger.error(
                    f"Invalid response for loglikelihood. Response: {response}"
                )
                assert False
        return res

    def generate_until(self, requests, disable_tqdm: bool = False):
        if not requests:
            return []

        res = []
        for request in tqdm([req.args for req in requests], disable=disable_tqdm):
            inp = request[0]
            request_args = request[1]
            until = request_args.get("until", ["</s>"])
            response = self.gguf_completion(context=inp, stop=until)
            if response and "content" in response and response["content"]:
                choice = response
                if "content" in choice:
                    generated_text = choice["content"].strip()
                    res.append(generated_text)
                else:
                    logger.error(
                        f"Invalid response for greedy_until. Response: {response}"
                    )
                    res.append(None)  # Add default value in case of error
            else:
                logger.error(f"Invalid response for greedy_until. Response: {response}")
                res.append(None)  # Add default value in case of error
        return res

    def loglikelihood_rolling(self, requests, disable_tqdm: bool = False):
        raise NotImplementedError(
            "loglikelihood_rolling not yet supported for GGUF models"
        )

But, it maybe wrong with loglikelihood, I run below code and get accuracy just 0.49

import json
from lm_eval.models.gguf import GGUFLM
from lm_eval import simple_evaluate

lm = GGUFLM(base_url="http:https://localhost:8080")    
# tasks.initialize_tasks()
results = simple_evaluate(model=lm,tasks=["piqa"]) 
#将results中的数据导出到json文件中  
filtered_results = results.copy()  
filtered_results = {key: value for key, value in results.items() if key != "samples"}  
json_filtered_results = json.dumps(filtered_results, indent=4)  
with open("results.json", "w") as json_file:
    json_file.write(json_filtered_results)

result:

{
    "results": {
        "piqa": {
            "acc,none": 0.4923830250272035,
            "acc_stderr,none": 0.011664470424044978,
            "acc_norm,none": 0.4923830250272035,
            "acc_norm_stderr,none": 0.011664470424044978,
            "alias": "piqa"
        }
    },
    "group_subtasks": {
        "piqa": []
    },
    "configs": {
        "piqa": {
            "task": "piqa",
            "dataset_path": "piqa",
            "training_split": "train",
            "validation_split": "validation",
            "doc_to_text": "Question: {{goal}}\nAnswer:",
            "doc_to_target": "label",
            "doc_to_choice": "{{[sol1, sol2]}}",
            "description": "",
            "target_delimiter": " ",
            "fewshot_delimiter": "\n\n",
            "num_fewshot": 0,
            "metric_list": [
                {
                    "metric": "acc",
                    "aggregation": "mean",
                    "higher_is_better": true
                },
                {
                    "metric": "acc_norm",
                    "aggregation": "mean",
                    "higher_is_better": true
                }
            ],
            "output_type": "multiple_choice",
            "repeats": 1,
            "should_decontaminate": true,
            "doc_to_decontamination_query": "goal",
            "metadata": {
                "version": 1.0
            }
        }
    },
    "versions": {
        "piqa": 1.0
    },
    "n-shot": {
        "piqa": 0
    },
    "config": {
        "model": "GGUFLM",
        "model_args": null,
        "batch_size": null,
        "batch_sizes": [],
        "device": null,
        "use_cache": null,
        "limit": null,
        "bootstrap_iters": 100000,
        "gen_kwargs": null
    },
    "git_hash": "4600d6bf",
    "date": 1711460633.7369168,
    "pretty_env_info": "PyTorch version: 2.2.1\nIs debug build: False\nCUDA used to build PyTorch: None\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 20.04.6 LTS (aarch64)\nGCC version: (GCC) 11.2.0\nClang version: Could not collect\nCMake version: version 3.16.3\nLibc version: glibc-2.31\n\nPython version: 3.10.14 (main, Mar 21 2024, 16:18:23) [GCC 11.2.0] (64-bit runtime)\nPython platform: Linux-5.4.0-153-generic-aarch64-with-glibc2.31\nIs CUDA available: False\nCUDA runtime version: No CUDA\nCUDA_MODULE_LOADING set to: N/A\nGPU models and configuration: No CUDA\nNvidia driver version: No CUDA\ncuDNN version: No CUDA\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture:                    aarch64\nCPU op-mode(s):                  32-bit, 64-bit\nByte Order:                      Little Endian\nCPU(s):                          8\nOn-line CPU(s) list:             0-7\nThread(s) per core:              1\nCore(s) per socket:              8\nSocket(s):                       1\nNUMA node(s):                    1\nVendor ID:                       ARM\nModel:                           0\nStepping:                        r0p0\nCPU max MHz:                     3000.0000\nCPU min MHz:                     3000.0000\nBogoMIPS:                        100.00\nL1d cache:                       512 KiB\nL1i cache:                       512 KiB\nL2 cache:                        8 MiB\nL3 cache:                        64 MiB\nNUMA node0 CPU(s):               0-7\nVulnerability Itlb multihit:     Not affected\nVulnerability L1tf:              Not affected\nVulnerability Mds:               Not affected\nVulnerability Meltdown:          Not affected\nVulnerability Mmio stale data:   Not affected\nVulnerability Retbleed:          Not affected\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl\nVulnerability Spectre v1:        Mitigation; __user pointer sanitization\nVulnerability Spectre v2:        Mitigation; CSV2, BHB\nVulnerability Srbds:             Not affected\nVulnerability Tsx async abort:   Not affected\nFlags:                           fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs sb dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] torch==2.2.1\n[conda] numpy                     1.26.4                   pypi_0    pypi\n[conda] torch                     2.2.1                    pypi_0    pypi",
    "transformers_version": "4.39.1",
    "upper_git_hash": null
}

@L1-M1ng
Copy link

L1-M1ng commented Apr 7, 2024

hi, I met the same problem as yours, have you solved this problem?

@haileyschoelkopf
Copy link
Contributor

What model here is being evaluated? I will try to look into this soon.

@gerayking
Copy link
Author

What model here is being evaluated? I will try to look into this soon.

Qwen1.8B
I discovered that the original script gguf.py, when matched with llama-cpp-python to start the server, is operational. However, the accuracy ([acc, none]) is only 0.68, which might indicate an error. Could you test it to confirm?

@gerayking
Copy link
Author

hi, I met the same problem as yours, have you solved this problem?

try to use llama-cpp-python replace llama.cpp start server

@L1-M1ng
Copy link

L1-M1ng commented Apr 8, 2024

hi, I met the same problem as yours, have you solved this problem?

try to use llama-cpp-python replace llama.cpp start server

Thanks for your reply, I' ll try it later.

@L1-M1ng
Copy link

L1-M1ng commented Apr 8, 2024

What model here is being evaluated? I will try to look into this soon.

Qwen1.8B I discovered that the original script gguf.py, when matched with llama-cpp-python to start the server, is operational. However, the accuracy ([acc, none]) is only 0.68, which might indicate an error. Could you test it to confirm?

I found that the way to compute log_logits_sum between gguf.py and huggingface.py are different.
In huggingface.py, it obtain log-probs at the corresponding continuation token indices.
image
but in gguf.py, it just compute the sum of the prob of model output token, do not select the prob of token by continuation token indices.
image
Am I right to understand it this way? @haileyschoelkopf

@haileyschoelkopf
Copy link
Contributor

In GGUF it should determine based on the offsets which tokens are part of the continuation and which are not (the while loop in the screenshot skips any context tokens).

I've definitely run this and gotten equivalent performance on a Llama-7B model compared against HF at the time... what is Qwen-1.8B's piqa accuracy in the huggingface implementation?

@gerayking
Copy link
Author

In GGUF it should determine based on the offsets which tokens are part of the continuation and which are not (the while loop in the screenshot skips any context tokens).

I've definitely run this and gotten equivalent performance on a Llama-7B model compared against HF at the time... what is Qwen-1.8B's piqa accuracy in the huggingface implementation?

Thank you for your assistance. I've tested Qwen-1.8B's accuracy on PIQA and found it to be 0.68, which I am not sure it is right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants