Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RateLimitError Causes Duplicate Logs and Incorrect Metrics #955

Closed
keltin13 opened this issue May 10, 2023 · 2 comments
Closed

RateLimitError Causes Duplicate Logs and Incorrect Metrics #955

keltin13 opened this issue May 10, 2023 · 2 comments
Labels
bug Something isn't working

Comments

@keltin13
Copy link

Describe the bug

If a RateLimit error occurs when running an eval, the evaluation code may run a sample more than once, causing the final reported accuracy to be incorrect.

In my case, a single log file contained these two results (exact same sample_id, but different timestamps indicating different runs):

{"run_id": "230510185932RNI3M7IV", "event_id": 184, "sample_id": "note-intervals.dev.316", "type": "sampling", "data": {"prompt": [{"role": "system", "content": "You are a note interval calculator. Given two notes, calculate the interval between them. Respond only with the abbreviated interval name (e.g. P4, d5)."}, {"role": "user", "content": "D# F"}], "sampled": ["m3"]}, "created_by": "", "created_at": "2023-05-10 19:02:26.613346+00:00"}
{"run_id": "230510185932RNI3M7IV", "event_id": 185, "sample_id": "note-intervals.dev.316", "type": "match", "data": {"correct": false, "expected": "d3", "picked": null, "sampled": "m3", "options": ["d3"]}, "created_by": "", "created_at": "2023-05-10 19:02:26.613346+00:00"}
{"run_id": "230510185932RNI3M7IV", "event_id": 194, "sample_id": "note-intervals.dev.316", "type": "sampling", "data": {"prompt": [{"role": "system", "content": "You are a note interval calculator. Given two notes, calculate the interval between them. Respond only with the abbreviated interval name (e.g. P4, d5)."}, {"role": "user", "content": "D# F"}], "sampled": ["m3"]}, "created_by": "", "created_at": "2023-05-10 19:02:46.433398+00:00"}
{"run_id": "230510185932RNI3M7IV", "event_id": 195, "sample_id": "note-intervals.dev.316", "type": "match", "data": {"correct": false, "expected": "d3", "picked": null, "sampled": "m3", "options": ["d3"]}, "created_by": "", "created_at": "2023-05-10 19:02:46.433398+00:00"}

This caused the final accuracy to be misreported in both the terminal output and log file as 93/433=0.21478 instead of 93/432=0.21527 (there are only 432 samples in the samples.jsonl file):

[2023-05-10 15:10:24,894] [oaieval.py:149] accuracy: 0.21478060046189376

Obviously this is a minor problem, but is clearly a correctness issue. As far as I can tell hitting a RateLimitError should only cause the requests to back of temporarily and then continue later, and should not indicate an eval failure. I am not familiar enough with the code to suggest a fix.

To Reproduce

Reproducing this issue will depend on your rate limits, but simply find and run a sufficiently large eval such that you observe an openai.error.RateLimitError. Read the log file and check for duplicate sample_ids.

Code snippets

No response

OS

Windows 10

Python version

v3.9.16

Library version

Cloned from source

@keltin13 keltin13 added the bug Something isn't working label May 10, 2023
@jwang47
Copy link
Contributor

jwang47 commented May 18, 2023

#987 should fix the issue, closing for now. Let me know if you see it again with the latest changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants
@jwang47 @keltin13 and others