Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default behavior bootstrapping #1121

Closed
tom-doerr opened this issue Jun 7, 2024 · 3 comments
Closed

Default behavior bootstrapping #1121

tom-doerr opened this issue Jun 7, 2024 · 3 comments

Comments

@tom-doerr
Copy link
Contributor

I think during bootstrapping there should be a special case for a situations where all scores and scalars are non zero.
To me it seems that bootstrapping fails completely and decreases performance when a metric is used that never or almost never is zero.

@arnavsinghvi11
Copy link
Collaborator

Hey @tom-doerr , I think I understand the issue here, but I'd love to hear more on what you mean, potentially with an example if possible.

Is the idea that there should be more dynamic feedback during the bootstrapping to ensure the demonstration selection will lead to equal or better performance compared to the uncompiled program, and avoid such cases where performance decreases? We could definitely explore some improvements to the existing BootstrapFewShot optimizer.

@tom-doerr
Copy link
Contributor Author

I'm not sure what a good solution would be, but the current behavior isn't optimal, in my opinion.

Example:
My objective is to generate great tweets.

tweet, score
Hdhuhhdhdh, 0.01
U88hdju, 0.1
Jhdjdjdjjd if d, 0.05
Hdhjjd, 0.02
Good morning to all my followers! 🌞, 0.8

Bootstrap will use the first 4 nonsense samples in its prompt, making performance much worse. This also happened to me on real data.
Sure, you can do random search, but in this case, it would also deliver worse than uncompiled performance.

@arnavsinghvi11
Copy link
Collaborator

I see! Ironically, as I was updating documentation for Bootstrap from your other issue #1118, the arg metric_threshold would be very useful in this case.
You could set a threshold of 0.75 for example to avoid having any example with a non-zero score being selected, and then the bootstrapping would only consider "passing examples" as ones that follow that condition. Let me know if that make sense!

@okhat okhat closed this as completed Jun 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
@okhat @tom-doerr @arnavsinghvi11 and others