Idea for Evals: Sorting numbers with repeats and negatives #782

voynow · 2023-04-23T21:01:06Z

Note: I can develop this feature - creating the issue to get some feedback before developement

Is this diverse from the existing evals or is this too basic? I skimmed through the existing evals and I don't see anything similar except for complex number pattern (#223). I don't currently have GPT4 api access, although I do have chatGPT plus. Using the GPT4 engine I have tested this idea with the following examples:

Example 1

input:
Sort the following numbers least to greatest (only include the numbers in your response):
3, 5, 2, 3, 10, 3, 5, 7, 7, 9, 10, 8, 7, 4, 5, 5, 6, 5, 1, 8, 1, 7, 4, 10, 4, 1, 5, 7, 3, 2

ideal:
1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 6, 7, 7, 7, 7, 7, 8, 8, 9, 10, 10, 10

response:
1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 6, 7, 7, 7, 7, 8, 8, 9, 10, 10, 10

In this example, GPT4 miscounted the 5s and the 7s

Example 1 retry in a new window

input:
Sort the following numbers least to greatest (only include the numbers in your response):
3, 5, 2, 3, 10, 3, 5, 7, 7, 9, 10, 8, 7, 4, 5, 5, 6, 5, 1, 8, 1, 7, 4, 10, 4, 1, 5, 7, 3, 2

ideal:
1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 6, 7, 7, 7, 7, 7, 8, 8, 9, 10, 10, 10

response:
1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 6, 7, 7, 7, 7, 7, 8, 8, 9, 10, 10, 10

In this example, GPT4 miscounted the 5s again

Example 2

input:
Sort the following numbers least to greatest (only include the numbers in your response):
2, -4, -8, 9, -1, 10, 8, -7, 7, -1, -4, -5, -1, 0, 1, 8, 2, 0, -8, -10, 8, -5, -10, 7, -1, -3, -1, 8, 7, -5, -2, 1, -4, 7, 9, 6, -8, 10, -5, 5, -6, 4, -5, -2, -8, -1, -10, 1, -8, -4

ideal:
-10, -10, -10, -8, -8, -8, -8, -8, -7, -6, -5, -5, -5, -5, -5, -4, -4, -4, -4, -3, -2, -2, -1, -1, -1, -1, -1, -1, 0, 0, 1, 1, 1, 2, 2, 4, 5, 6, 7, 7, 7, 7, 8, 8, 8, 8, 9, 9, 10, 10

response:
-10, -10, -10, -8, -8, -8, -8, -8, -7, -6, -5, -5, -5, -5, -4, -4, -4, -4, -3, -2, -2, -1, -1, -1, -1, -1, 0, 0, 1, 1, 1, 2, 2, 4, 5, 6, 7, 7, 7, 7, 8, 8, 8, 8, 9, 9, 10, 10

Let me know what you all think. This would be my first contribution to open source - very exciting!

qrdlgit · 2023-04-23T22:33:50Z

@voynow Why did you close this? It looks good to me, but you should get Andrew's opinion.

voynow · 2023-04-23T22:46:37Z

@qrdlgit I found this (#93) PR that looks like it does what I was planning on doing. This has been opened for a while with no approval, maybe something is wrong with this one?

Am I correct in my understanding here? Didn't want to duplicate work/logic.

qrdlgit · 2023-04-23T23:48:31Z

Yeah, I saw that as well. TBH though, this seems like a great eval to me, but I'm just a user.

Sorting things like this is a very common use case that anyone might use GPT4 for.

For example, let's say you are a teacher and have a set of names or ids of students and you want to sort them in some way as a way of creating a 'fair order'. This is very common, as we all know.

I also very frequently use it for one off decoding/encoding tasks. It's a bit unnerving to see it fail so silently like this.

It'd be great to get an @andrew-openai perspective. It could be just a hard thing for them to fix at this point, which might be why they don't want an eval (yet), but I think it would help to hear that.

voynow · 2023-04-23T23:52:31Z

Great perspective thanks for adding that. I want to point out that I also just created #785 - so I can work on either one of these once we get some more perspectives here.

Ein-Tim · 2023-04-24T05:36:08Z

@voynow If you want more feedback on this issue, I suggest reopening it.

andrew-openai · 2023-04-26T22:53:43Z

Hey, thanks for the discussion!

I agree this is a good eval idea, and I agree with qrdlgit that it seems to be quite representative of common tasks. Also, thanks for bringing my attention to #93, it looks like a good eval and I'll probably merge it after testing it myself.

I like the examples you've given: sorting lists of students or encoding/decoding tasks. If you are interested in contributing evals of this flavor, having these domain specific variants are quite useful and I wouldn't be surprised if model performances vary across the "domain" that this basic capability is applied to. We've reduced the minimum count to 15 samples per eval, so this should be pretty quick to write by hand or collect variants of what you may already be using with the API or ChatGPT.

voynow · 2023-05-04T11:49:37Z

@andrew-openai Thanks for your feedback above. FYI I created two PRs based on your suggestions. See below:

Sorting rectangles by area: #878
Counting numbers greater than X: #856

voynow closed this as not planned Won't fix, can't repro, duplicate, stale Apr 23, 2023

voynow reopened this Apr 24, 2023

andrew-openai added the Idea for Eval These issues keep track of requests for different kinds of eval PRs label Apr 26, 2023

andrew-openai closed this as completed Apr 28, 2023

andrew-openai reopened this Apr 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idea for Evals: Sorting numbers with repeats and negatives #782

Idea for Evals: Sorting numbers with repeats and negatives #782

voynow commented Apr 23, 2023

qrdlgit commented Apr 23, 2023

voynow commented Apr 23, 2023

qrdlgit commented Apr 23, 2023

voynow commented Apr 23, 2023

Ein-Tim commented Apr 24, 2023

andrew-openai commented Apr 26, 2023 •

edited

Loading

voynow commented May 4, 2023 •

edited

Loading

Idea for Evals: Sorting numbers with repeats and negatives #782

Idea for Evals: Sorting numbers with repeats and negatives #782

Comments

voynow commented Apr 23, 2023

Example 1

Example 1 retry in a new window

Example 2

qrdlgit commented Apr 23, 2023

voynow commented Apr 23, 2023

qrdlgit commented Apr 23, 2023

voynow commented Apr 23, 2023

Ein-Tim commented Apr 24, 2023

andrew-openai commented Apr 26, 2023 • edited Loading

voynow commented May 4, 2023 • edited Loading

andrew-openai commented Apr 26, 2023 •

edited

Loading

voynow commented May 4, 2023 •

edited

Loading