Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add WIP support for returning top tokens #617

Closed
wants to merge 12 commits into from

Conversation

Vinno97
Copy link
Contributor

@Vinno97 Vinno97 commented Jul 14, 2023

What does this PR do?

This PR add support for return the log probabilities of the top N tokens (besides the chosen model).

The token selection code is loaned from the IBM's fork (much thanks @njhill) and I merged into the current main branch. I chose to name the input parameter top_n_tokens and it can be any (positive) integer. If it's set, a top_tokens field in the result details will contain a LxN list of tokens where L is the sequence length and N is the amount of requested tokens.

Note that it is currently very much WIP.

  • It is currently only implemented for seq-to-seq models (though it's modular and should be easy to implement for other models as well).
  • I haven't implemented this functionality for the prefill. Do we want to?
  • I sort the tokens based on their logprobs, because IBM's implementation had the functionality. However, it may cause unnecessary overhead.
  • IBM's code doesn't just return the "top n" tokens. It actually (explicitly) returns more tokens when there are tokens with equal probabilities. I am not sure if this is desired.
  • I have not yet implemented input validation of the top_n_tokens field.
  • I have also done no benchmarking to test the impact on performance.
    • I haven't given the benchmarking tool the capability to benchmark this feature.

Fixes #604

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed.
Specifically @OlivierDehaene and @njhill

@spew
Copy link

spew commented Jul 19, 2023

Could this be used to "shorten" a generation (i.e. only use the first N tokens)? If so, how would one use the logprobs to calculate a threshold for throwing out all or a portion of a generation?

@Vinno97
Copy link
Contributor Author

Vinno97 commented Jul 24, 2023

This PR only adds the ability to not just return the sampled token, but also the most likely alternative tokens. This can be useful mostly when building applications that require the logprobs of multiple tokens, like ranking and classification

What you are referring to, sounds like a custom stopping criterium.

@OlivierDehaene
Copy link
Member

OlivierDehaene commented Jul 25, 2023

This is very cool. Thanks for the work!
Are you ok with me maybe merging this to a dev branch to add my modifications directly before merging it to the main branch?
Cheers!

@Vinno97
Copy link
Contributor Author

Vinno97 commented Jul 26, 2023

It currently only works (properly) for FlashCausalLM models. Perhaps it'd be good to wait until I implement it everywhere

@Vinno97
Copy link
Contributor Author

Vinno97 commented Jul 28, 2023

@OlivierDehaene I think it's mergable now!

Haven't edited the Python Client yet, but all server functionalities should work now. For some reason, my latest commit (95d0fba) doesn't show up yet in this PR, however.

@OlivierDehaene
Copy link
Member

@Vinno97, I will be off in August but @Narsil will be available to review the PR.
For your info, we updated the license of this repo to HFOIL. You can review the reasons behind this license change here.

@Vinno97
Copy link
Contributor Author

Vinno97 commented Jul 29, 2023

Thank you for notifying me on the licencing change. Luckily, as far as I understand it, this has no effect on our operations for now

@Vinno97 Vinno97 marked this pull request as ready for review August 2, 2023 13:08
@Vinno97
Copy link
Contributor Author

Vinno97 commented Aug 2, 2023

Hi @Narsil. Could you review this PR? It touches a lot of different files though most changes are only about passing parameters and results have to be passed through every communication phase. The token logic is contained in tokens.py#batch_top_tokens . Additionally, I added top_n_token support separately to CausalLM, FlashCausalLM and Seq2SeqLM .

@Narsil
Copy link
Collaborator

Narsil commented Aug 3, 2023

Thanks for the PR.

I'm pretty stretched atm and this PR looks like it requires a lot of attention.
I intend to look at it early next week. Don't hesitate to ping if I didn't.

@Vinno97
Copy link
Contributor Author

Vinno97 commented Aug 3, 2023

Understandable.
In the meantime, we started running it in our testing/experimentation enviroment and a colleague of mine is integrating it with a private fork of lm-eval. If we find any issues there, I'll of course update the PR.

@Vinno97
Copy link
Contributor Author

Vinno97 commented Aug 9, 2023

Just fixed a small oversight in the batch concatenation for flash attention models, which came up during testing

@Vinno97
Copy link
Contributor Author

Vinno97 commented Aug 14, 2023

I intend to look at it early next week. Don't hesitate to ping if I didn't.

@Narsil ping 🙂

Since it's my PR, I obviously can't review myself. But let me know if I can be of any help.

@Narsil
Copy link
Collaborator

Narsil commented Aug 17, 2023

Ok I looked at the code. On the surface the code looks really nice !

I need to invest a bit of time investigating the performance changes on various models/platforms.
At least making sure there's no huge regression.

@Narsil
Copy link
Collaborator

Narsil commented Aug 17, 2023

Are you OK if I do the rebase to be up-to-date too ?

@Narsil
Copy link
Collaborator

Narsil commented Aug 17, 2023

Ok here are the results from the bench:

https://github.com/Narsil/tmp_bench/blob/main/README.md
Performance is affected mostly for smaller models without TP.
It kind of makes sense since the less compute on the model, the more compute actually happens during logits processing. I'm still surprised by the 25%.

At least there doesn't seem to be any degradation without using top_p_tokens, so it should be good for most users (and users can deactivate the option)

Overall I find everything proposed in the PR acceptable.
Default to top_k=5 tokens seems reasonable (bench used 10).

@Narsil
Copy link
Collaborator

Narsil commented Aug 17, 2023

Here is my rebased version: #868

Narsil added a commit that referenced this pull request Aug 28, 2023
# What does this PR do?

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->

---------

Co-authored-by: Vincent Brouwers <[email protected]>
@Vinno97
Copy link
Contributor Author

Vinno97 commented Aug 28, 2023

Closing since @Narsil's merge via #868

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Return most probable tokens + logprobs
4 participants