Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Are not merged PRs the result of irrelevancy to the model? #873

Open
albukirky1 opened this issue Apr 30, 2023 · 12 comments
Open

Are not merged PRs the result of irrelevancy to the model? #873

albukirky1 opened this issue Apr 30, 2023 · 12 comments

Comments

@albukirky1
Copy link

albukirky1 commented Apr 30, 2023

Describe the feature or improvement you're requesting

Hi, this is not a suggestion, but rather a question.

I have been working on new ideas of evals lately, but none seem to be reviewed.

I was wondering - is this due to the PRs (eval ideas) not being important enough (or not big enough of contribution) for the model?

I'm currently unaware if my way of thinking of ideas of evaluations is right, and perhaps my PRs are not in the right direction, and my way of thinking (and perhaps others as well) should be adjusted in order to contribute better evals.

Example of a PR I lately sent:
#841

Additional context

No response

@qrdlgit
Copy link
Contributor

qrdlgit commented Apr 30, 2023

I downloaded all the merged PRs and asked GPT4 to summarize the common characteristics:

The merged evals cover a wide range of topics and skills, including:

  • Language understanding: Japanese, Russian, Dutch, Brazilian, Swedish, Greek, Bulgarian, Belarusian, Mongolian, Ukrainian, and Hebrew.
  • Medical knowledge: Japanese national medical exam, heart disease prediction, Russian medical, and MedMCQA.
  • Science and mathematics: pH calculation, general science reasoning, Mendelian inheritance, balancing chemical equations, and algebra word problems.
  • Spatial and logical reasoning: SVG understanding, three-point gene mapping, knot theory, physical rotation reasoning, LogiQA, and diagrammatical reasoning logic.
  • Legal knowledge: Illinois law claims, US tort law, and legal ethics.
  • Finance and economics: utility charge eval, financial math, and taxes eval.
  • Computer science and programming: bitwise eval, Forth Stack Simulator, and computer science theory.
  • Emotional intelligence: emotional intelligence evaluation.
  • Music theory: tempo and time signature.
  • Driving and navigation: Japanese driving license and lat-long-identify eval.
  • Chess: counting pieces left on the board and playing chess.
  • Miscellaneous: rhymes, emoji riddles, ROT13 strings, anagrams, counting bigrams, poker hand ranks, positive-binary-operations, chess, and sarcasm detection.

These evals assess various capabilities of the AI model, including language understanding, subject matter knowledge, problem-solving skills, spatial understanding, and emotional intelligence.

@albukirky1
Copy link
Author

albukirky1 commented Apr 30, 2023

@qrdlgit I'm not sure if the sample database of the merged PRs is large enough to conclude anything about what they look for, but this is a really nice observation.
It does make sense that the most PRs wrap around languages, so that maybe the model will have better understanding of how to digest large text, rather than give better answers.

@SkyaTura
Copy link

SkyaTura commented May 1, 2023

@qrdlgit just for curiosity, did you tried to identify the patterns on the ignored PRs?

Edit: Actually, would be great to analyze every PR with their status, as open-active, open-stale, draft-active, draft-stale, closed-merged, closed-canceled, and so on

@qrdlgit
Copy link
Contributor

qrdlgit commented May 1, 2023

@SkyaTura Yes, absolutely. For those serious about creating an eval here, there is definitely value in going back through all the PRs and reading them closely.

That said, it's possible there are extrinsic factors not mentioned in the documentation. It's sometimes difficult to predict what those might be.

@SkyaTura
Copy link

SkyaTura commented May 1, 2023

I was wonder what we could extract by iterating the whole PR history over LLM itself 🤔

That would be expensive, tho.

I'm still figuring out how this works, just found this repo a couple minutes ago.

@qrdlgit
Copy link
Contributor

qrdlgit commented May 1, 2023

Not so much expensive, though perhaps a bit technically challenging. However, we can always ask GPT4, right?

Try this prompt:

I'd like to better understand why PRs are being merged and not merged. Is there a way I can extract all the PR data for a particular repository on github and feed it to GPT4 to summarize and analyze?

Depending on your particular skill set, you might need to get GPT4 to do a further breakdown on what it provides. Also, you will need to explain that you will be using the web interface for GPT4. I'd recommend using the git REST apis if possible.

@SkyaTura
Copy link

SkyaTura commented May 1, 2023

Indeed, I already get the PR history to try something, but there is not much beyond you mentioned before getting only by the title.

Probably sanitizing the descriptions and prompting them as well may provide better results, but it should be made programatically and I would need GPT-4 key for that, tho.

(I also don't have gptPlus yet, it's too much in my currency)

Maybe I'll try a proof this concept with the 3.5 and a handpicked selection later.

Sorry for deviating the original question of the issue, btw

@qrdlgit
Copy link
Contributor

qrdlgit commented May 1, 2023

@SkyaTura I think your deviation was important and there needs to be more discussion around this topic - but you're right. I'll take the blame for the hijack here and so I have opened a discussion on this topic which I will continue there: #882

@eugene-kim-pipe17
Copy link

@andrew-openai is there anything you can share here?

I've submitted a couple eval PRs as well (#763 and #747). It would be great to know if the lack of response is simply due to a large backlog of PRs to assess (I'm sure you and your team are very busy) or if it's because of issues with the PR content / quality.

@qrdlgit
Copy link
Contributor

qrdlgit commented May 2, 2023

One suggestion for folks at open ai, you might want to add an attribute to the checkbox:

[] I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be reviewed, merged nor GPT-4 access granted.

Please note: This is not meant to be a complaint. I think we all understand that OpenAI is resource constrained and is trying to strike the right balance in terms of how it's providing access. However, I think it'd be fair to folks to let them know up front of the situation as they may have the expectation that they will get feedback on their PR.

I am working on a GPT prompt that could provide some reviewing / critiquing capability: #882 I'm coming to terms with the fact that most of our PRs probably won't get merged into this repo, but I am concerned that there is a missed opportunity here. These PRs could be useful for other AI projects and so some review / feedback I think would help ensure that the evals are more well formed and generally useful.

If @andrew-openai or others could take a look at the prompt and provide some thoughts on how to improve it so we'd get some review capability, that would be very helpful.

@andrew-openai
Copy link
Contributor

Hi folks, sorry for the pace of PR reviews, I actually took some time off this week which is why there haven't been many reviews in the past few days.

I was wondering - is this due to the PRs (eval ideas) not being important enough (or not big enough of contribution) for the model?

The general pattern has been: most eval PRs have good content but need iterating on the prompts to be meaningful evals. This, alongside recognizing that it takes quite some time and effort to open an Eval PR, I'm trying to make sure that each PR gets some feedback left on how to improve it rather than an outright rejection. So while I have looked at many evals, I haven't had the chance to leave that feedback on each one. We're well aware that this is slowing down the process that most PRs get reviewed.

In the next few weeks, there will be more people from our side available to review Eval PRs and leave that feedback, beyond just me. This should dramatically improve the pace by which you get feedback on your ideas, and close PRs.

Thanks for the patience. We love the enthusiasm and the contributions so far have been great. Until we get more help, I'll also resume reviewing Evals over the next few days.

@eugene-kim-pipe17
Copy link

I appreciate the response and the transparency @andrew-openai !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants