Add WIP support for returning top tokens #617

Vinno97 · 2023-07-14T20:08:46Z

What does this PR do?

This PR add support for return the log probabilities of the top N tokens (besides the chosen model).

The token selection code is loaned from the IBM's fork (much thanks @njhill) and I merged into the current main branch. I chose to name the input parameter top_n_tokens and it can be any (positive) integer. If it's set, a top_tokens field in the result details will contain a LxN list of tokens where L is the sequence length and N is the amount of requested tokens.

Note that it is currently very much WIP.

It is currently only implemented for seq-to-seq models (though it's modular and should be easy to implement for other models as well).
I haven't implemented this functionality for the prefill. Do we want to?
I sort the tokens based on their logprobs, because IBM's implementation had the functionality. However, it may cause unnecessary overhead.
IBM's code doesn't just return the "top n" tokens. It actually (explicitly) returns more tokens when there are tokens with equal probabilities. I am not sure if this is desired.
I have not yet implemented input validation of the top_n_tokens field.
I have also done no benchmarking to test the impact on performance.
- I haven't given the benchmarking tool the capability to benchmark this feature.

Fixes #604

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed.
Specifically @OlivierDehaene and @njhill

spew · 2023-07-19T23:37:07Z

Could this be used to "shorten" a generation (i.e. only use the first N tokens)? If so, how would one use the logprobs to calculate a threshold for throwing out all or a portion of a generation?

Vinno97 · 2023-07-24T09:54:22Z

This PR only adds the ability to not just return the sampled token, but also the most likely alternative tokens. This can be useful mostly when building applications that require the logprobs of multiple tokens, like ranking and classification

What you are referring to, sounds like a custom stopping criterium.

OlivierDehaene · 2023-07-25T17:59:30Z

This is very cool. Thanks for the work!
Are you ok with me maybe merging this to a dev branch to add my modifications directly before merging it to the main branch?
Cheers!

Vinno97 · 2023-07-26T07:18:35Z

It currently only works (properly) for FlashCausalLM models. Perhaps it'd be good to wait until I implement it everywhere

Initial support returning the most probable tokens. Note that it is currently only implemented for seq-to-seq models. It is also always enabled, regardless of whether it is used or not.

Vinno97 · 2023-07-28T14:34:28Z

@OlivierDehaene I think it's mergable now!

Haven't edited the Python Client yet, but all server functionalities should work now. For some reason, my latest commit (95d0fba) doesn't show up yet in this PR, however.

OlivierDehaene · 2023-07-28T15:14:24Z

@Vinno97, I will be off in August but @Narsil will be available to review the PR.
For your info, we updated the license of this repo to HFOIL. You can review the reasons behind this license change here.

Vinno97 · 2023-07-29T14:32:35Z

Thank you for notifying me on the licencing change. Luckily, as far as I understand it, this has no effect on our operations for now

Mimics the behaviour of `best_of`. Also allows client compatibility with older versions

Vinno97 · 2023-08-02T13:24:14Z

Hi @Narsil. Could you review this PR? It touches a lot of different files though most changes are only about passing parameters and results have to be passed through every communication phase. The token logic is contained in tokens.py#batch_top_tokens . Additionally, I added top_n_token support separately to CausalLM, FlashCausalLM and Seq2SeqLM .

Narsil · 2023-08-03T09:39:13Z

Thanks for the PR.

I'm pretty stretched atm and this PR looks like it requires a lot of attention.
I intend to look at it early next week. Don't hesitate to ping if I didn't.

Vinno97 · 2023-08-03T11:06:54Z

Understandable.
In the meantime, we started running it in our testing/experimentation enviroment and a colleague of mine is integrating it with a private fork of lm-eval. If we find any issues there, I'll of course update the PR.

Vinno97 · 2023-08-09T08:50:52Z

Just fixed a small oversight in the batch concatenation for flash attention models, which came up during testing

Vinno97 · 2023-08-14T09:44:21Z

I intend to look at it early next week. Don't hesitate to ping if I didn't.

@Narsil ping 🙂

Since it's my PR, I obviously can't review myself. But let me know if I can be of any help.

Narsil · 2023-08-17T13:37:45Z

Ok I looked at the code. On the surface the code looks really nice !

I need to invest a bit of time investigating the performance changes on various models/platforms.
At least making sure there's no huge regression.

Narsil · 2023-08-17T13:38:12Z

Are you OK if I do the rebase to be up-to-date too ?

Narsil · 2023-08-17T15:17:56Z

Ok here are the results from the bench:

https://github.com/Narsil/tmp_bench/blob/main/README.md
Performance is affected mostly for smaller models without TP.
It kind of makes sense since the less compute on the model, the more compute actually happens during logits processing. I'm still surprised by the 25%.

At least there doesn't seem to be any degradation without using top_p_tokens, so it should be good for most users (and users can deactivate the option)

Overall I find everything proposed in the PR acceptable.
Default to top_k=5 tokens seems reasonable (bench used 10).

Narsil · 2023-08-17T15:19:04Z

Here is my rebased version: #868

@OlivierDehaene

# What does this PR do?   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  --------- Co-authored-by: Vincent Brouwers <[email protected]>

Vinno97 · 2023-08-28T12:04:30Z

Closing since @Narsil's merge via #868

Vinno97 added 6 commits July 27, 2023 08:45

Add WIP support for returning top tokens

7c014c7

Initial support returning the most probable tokens. Note that it is currently only implemented for seq-to-seq models. It is also always enabled, regardless of whether it is used or not.

Add top-n-tokens support to benchmark

a7be416

Add batched top-n-tokens to FlashCausalLM

f809f17

Share computation for top-n-token decoding

494e6b1

Implement top-n-tokens for all models

50d05fa

Return more top-n-tokens when probabilities are equal

95d0fba

Vinno97 force-pushed the feat/return-top-tokens branch from 6b29a32 to 95d0fba Compare July 28, 2023 14:34

OlivierDehaene mentioned this pull request Jul 29, 2023

Text-Generation-Inference v1.0+ new license: HFOIL 1.0 #726

Closed

Vinno97 added 5 commits July 31, 2023 13:09

Allocate top_n_token tensor in Batch

d16298b

Skip top-n tokens in prefill

730d86f

Defer building top-token objects to Rust

8471e18

Add max_top_n_tokens CLI argument

65c7b62

Only return top_tokens field when requested

e30f4f6

Mimics the behaviour of `best_of`. Also allows client compatibility with older versions

Vinno97 marked this pull request as ready for review August 2, 2023 13:08

Fix typo in batch concatination

ea78c5c

Narsil mentioned this pull request Aug 17, 2023

Added support for logit_bias in token generation #810

Closed

10 tasks

Vinno97 closed this Aug 28, 2023

Vinno97 mentioned this pull request Aug 28, 2023

Return most probable tokens + logprobs #604

Closed

ManuelFay mentioned this pull request Sep 19, 2023

TGI support - API evaluation of HF models EleutherAI/lm-evaluation-harness#869

Open

deepindeed2022 mentioned this pull request Oct 23, 2023

[Feature] Return most probable tokens + logprobs InternLM/lmdeploy#598

Closed

lapp0 mentioned this pull request Dec 8, 2023

Custom LogitsProcessor/LogitsWarper #1050

Closed

abhimanyupallavisudhir mentioned this pull request Jul 17, 2024

Alt scaffolding ArjunPanickssery/math_problems_debate#3

Merged

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add WIP support for returning top tokens #617

Add WIP support for returning top tokens #617

Vinno97 commented Jul 14, 2023 •

edited

Loading

spew commented Jul 19, 2023

Vinno97 commented Jul 24, 2023

OlivierDehaene commented Jul 25, 2023 •

edited

Loading

Vinno97 commented Jul 26, 2023

Vinno97 commented Jul 28, 2023

OlivierDehaene commented Jul 28, 2023

Vinno97 commented Jul 29, 2023

Vinno97 commented Aug 2, 2023

Narsil commented Aug 3, 2023

Vinno97 commented Aug 3, 2023

Vinno97 commented Aug 9, 2023 •

edited

Loading

Vinno97 commented Aug 14, 2023

Narsil commented Aug 17, 2023

Narsil commented Aug 17, 2023

Narsil commented Aug 17, 2023

Narsil commented Aug 17, 2023

Vinno97 commented Aug 28, 2023

Add WIP support for returning top tokens #617

Add WIP support for returning top tokens #617

Conversation

Vinno97 commented Jul 14, 2023 • edited Loading

What does this PR do?

Before submitting

Who can review?

spew commented Jul 19, 2023

Vinno97 commented Jul 24, 2023

OlivierDehaene commented Jul 25, 2023 • edited Loading

Vinno97 commented Jul 26, 2023

Vinno97 commented Jul 28, 2023

OlivierDehaene commented Jul 28, 2023

Vinno97 commented Jul 29, 2023

Vinno97 commented Aug 2, 2023

Narsil commented Aug 3, 2023

Vinno97 commented Aug 3, 2023

Vinno97 commented Aug 9, 2023 • edited Loading

Vinno97 commented Aug 14, 2023

Narsil commented Aug 17, 2023

Narsil commented Aug 17, 2023

Narsil commented Aug 17, 2023

Narsil commented Aug 17, 2023

Vinno97 commented Aug 28, 2023

Vinno97 commented Jul 14, 2023 •

edited

Loading

OlivierDehaene commented Jul 25, 2023 •

edited

Loading

Vinno97 commented Aug 9, 2023 •

edited

Loading