Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idea for Evals: Sorting numbers with repeats and negatives #782

Open
voynow opened this issue Apr 23, 2023 · 7 comments
Open

Idea for Evals: Sorting numbers with repeats and negatives #782

voynow opened this issue Apr 23, 2023 · 7 comments
Labels
Idea for Eval These issues keep track of requests for different kinds of eval PRs

Comments

@voynow
Copy link

voynow commented Apr 23, 2023

Note: I can develop this feature - creating the issue to get some feedback before developement

Is this diverse from the existing evals or is this too basic? I skimmed through the existing evals and I don't see anything similar except for complex number pattern (#223). I don't currently have GPT4 api access, although I do have chatGPT plus. Using the GPT4 engine I have tested this idea with the following examples:

Example 1

input:
Sort the following numbers least to greatest (only include the numbers in your response):
3, 5, 2, 3, 10, 3, 5, 7, 7, 9, 10, 8, 7, 4, 5, 5, 6, 5, 1, 8, 1, 7, 4, 10, 4, 1, 5, 7, 3, 2

ideal:
1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 6, 7, 7, 7, 7, 7, 8, 8, 9, 10, 10, 10

response:
1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 6, 7, 7, 7, 7, 8, 8, 9, 10, 10, 10

In this example, GPT4 miscounted the 5s and the 7s

Example 1 retry in a new window

input:
Sort the following numbers least to greatest (only include the numbers in your response):
3, 5, 2, 3, 10, 3, 5, 7, 7, 9, 10, 8, 7, 4, 5, 5, 6, 5, 1, 8, 1, 7, 4, 10, 4, 1, 5, 7, 3, 2

ideal:
1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 6, 7, 7, 7, 7, 7, 8, 8, 9, 10, 10, 10

response:
1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 6, 7, 7, 7, 7, 7, 8, 8, 9, 10, 10, 10

In this example, GPT4 miscounted the 5s again

Example 2

input:
Sort the following numbers least to greatest (only include the numbers in your response):
2, -4, -8, 9, -1, 10, 8, -7, 7, -1, -4, -5, -1, 0, 1, 8, 2, 0, -8, -10, 8, -5, -10, 7, -1, -3, -1, 8, 7, -5, -2, 1, -4, 7, 9, 6, -8, 10, -5, 5, -6, 4, -5, -2, -8, -1, -10, 1, -8, -4

ideal:
-10, -10, -10, -8, -8, -8, -8, -8, -7, -6, -5, -5, -5, -5, -5, -4, -4, -4, -4, -3, -2, -2, -1, -1, -1, -1, -1, -1, 0, 0, 1, 1, 1, 2, 2, 4, 5, 6, 7, 7, 7, 7, 8, 8, 8, 8, 9, 9, 10, 10

response:
-10, -10, -10, -8, -8, -8, -8, -8, -7, -6, -5, -5, -5, -5, -4, -4, -4, -4, -3, -2, -2, -1, -1, -1, -1, -1, 0, 0, 1, 1, 1, 2, 2, 4, 5, 6, 7, 7, 7, 7, 8, 8, 8, 8, 9, 9, 10, 10

Let me know what you all think. This would be my first contribution to open source - very exciting!

@voynow voynow closed this as not planned Won't fix, can't repro, duplicate, stale Apr 23, 2023
@qrdlgit
Copy link
Contributor

qrdlgit commented Apr 23, 2023

@voynow Why did you close this? It looks good to me, but you should get Andrew's opinion.

@voynow
Copy link
Author

voynow commented Apr 23, 2023

@qrdlgit I found this (#93) PR that looks like it does what I was planning on doing. This has been opened for a while with no approval, maybe something is wrong with this one?

Am I correct in my understanding here? Didn't want to duplicate work/logic.

@qrdlgit
Copy link
Contributor

qrdlgit commented Apr 23, 2023

Yeah, I saw that as well. TBH though, this seems like a great eval to me, but I'm just a user.

Sorting things like this is a very common use case that anyone might use GPT4 for.

For example, let's say you are a teacher and have a set of names or ids of students and you want to sort them in some way as a way of creating a 'fair order'. This is very common, as we all know.

I also very frequently use it for one off decoding/encoding tasks. It's a bit unnerving to see it fail so silently like this.

It'd be great to get an @andrew-openai perspective. It could be just a hard thing for them to fix at this point, which might be why they don't want an eval (yet), but I think it would help to hear that.

@voynow
Copy link
Author

voynow commented Apr 23, 2023

Great perspective thanks for adding that. I want to point out that I also just created #785 - so I can work on either one of these once we get some more perspectives here.

@Ein-Tim
Copy link
Contributor

Ein-Tim commented Apr 24, 2023

@voynow If you want more feedback on this issue, I suggest reopening it.

@voynow voynow reopened this Apr 24, 2023
@andrew-openai andrew-openai added the Idea for Eval These issues keep track of requests for different kinds of eval PRs label Apr 26, 2023
@andrew-openai
Copy link
Contributor

andrew-openai commented Apr 26, 2023

Hey, thanks for the discussion!

I agree this is a good eval idea, and I agree with qrdlgit that it seems to be quite representative of common tasks. Also, thanks for bringing my attention to #93, it looks like a good eval and I'll probably merge it after testing it myself.

I like the examples you've given: sorting lists of students or encoding/decoding tasks. If you are interested in contributing evals of this flavor, having these domain specific variants are quite useful and I wouldn't be surprised if model performances vary across the "domain" that this basic capability is applied to. We've reduced the minimum count to 15 samples per eval, so this should be pretty quick to write by hand or collect variants of what you may already be using with the API or ChatGPT.

@voynow
Copy link
Author

voynow commented May 4, 2023

@andrew-openai Thanks for your feedback above. FYI I created two PRs based on your suggestions. See below:

Sorting rectangles by area: #878
Counting numbers greater than X: #856

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Idea for Eval These issues keep track of requests for different kinds of eval PRs
Projects
None yet
Development

No branches or pull requests

4 participants