RateLimitError Causes Duplicate Logs and Incorrect Metrics #955

keltin13 · 2023-05-10T20:04:36Z

Describe the bug

If a RateLimit error occurs when running an eval, the evaluation code may run a sample more than once, causing the final reported accuracy to be incorrect.

In my case, a single log file contained these two results (exact same sample_id, but different timestamps indicating different runs):

{"run_id": "230510185932RNI3M7IV", "event_id": 184, "sample_id": "note-intervals.dev.316", "type": "sampling", "data": {"prompt": [{"role": "system", "content": "You are a note interval calculator. Given two notes, calculate the interval between them. Respond only with the abbreviated interval name (e.g. P4, d5)."}, {"role": "user", "content": "D# F"}], "sampled": ["m3"]}, "created_by": "", "created_at": "2023-05-10 19:02:26.613346+00:00"}
{"run_id": "230510185932RNI3M7IV", "event_id": 185, "sample_id": "note-intervals.dev.316", "type": "match", "data": {"correct": false, "expected": "d3", "picked": null, "sampled": "m3", "options": ["d3"]}, "created_by": "", "created_at": "2023-05-10 19:02:26.613346+00:00"}

{"run_id": "230510185932RNI3M7IV", "event_id": 194, "sample_id": "note-intervals.dev.316", "type": "sampling", "data": {"prompt": [{"role": "system", "content": "You are a note interval calculator. Given two notes, calculate the interval between them. Respond only with the abbreviated interval name (e.g. P4, d5)."}, {"role": "user", "content": "D# F"}], "sampled": ["m3"]}, "created_by": "", "created_at": "2023-05-10 19:02:46.433398+00:00"}
{"run_id": "230510185932RNI3M7IV", "event_id": 195, "sample_id": "note-intervals.dev.316", "type": "match", "data": {"correct": false, "expected": "d3", "picked": null, "sampled": "m3", "options": ["d3"]}, "created_by": "", "created_at": "2023-05-10 19:02:46.433398+00:00"}

This caused the final accuracy to be misreported in both the terminal output and log file as 93/433=0.21478 instead of 93/432=0.21527 (there are only 432 samples in the samples.jsonl file):

[2023-05-10 15:10:24,894] [oaieval.py:149] accuracy: 0.21478060046189376

Obviously this is a minor problem, but is clearly a correctness issue. As far as I can tell hitting a RateLimitError should only cause the requests to back of temporarily and then continue later, and should not indicate an eval failure. I am not familiar enough with the code to suggest a fix.

To Reproduce

Reproducing this issue will depend on your rate limits, but simply find and run a sufficiently large eval such that you observe an openai.error.RateLimitError. Read the log file and check for duplicate sample_ids.

Code snippets

No response

OS

Windows 10

Python version

v3.9.16

Library version

Cloned from source

The text was updated successfully, but these errors were encountered:

jwang47 · 2023-05-18T17:59:04Z

#987 should fix the issue, closing for now. Let me know if you see it again with the latest changes.

keltin13 added the bug Something isn't working label May 10, 2023

jwang47 closed this as completed May 18, 2023

robatwilliams mentioned this issue Sep 20, 2023

Sample evaluations completing after timeout cause duplicate results #1333

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RateLimitError Causes Duplicate Logs and Incorrect Metrics #955

RateLimitError Causes Duplicate Logs and Incorrect Metrics #955

keltin13 commented May 10, 2023

jwang47 commented May 18, 2023

RateLimitError Causes Duplicate Logs and Incorrect Metrics #955

RateLimitError Causes Duplicate Logs and Incorrect Metrics #955

Comments

keltin13 commented May 10, 2023

Describe the bug

To Reproduce

Code snippets

OS

Python version

Library version

jwang47 commented May 18, 2023