Very slow for inputs like "a" * 100000 #195

vvolhejn · 2023-09-18T07:57:49Z

I've noticed that Tiktoken is really slow for strings of repeated characters like "a" * 100_000. Interestingly, when you add spaces, like "a " * 50_000, the performance is orders of magnitude better:

Is this a bug or a fundamental property of BPE?

My Tiktoken version is 0.5.0.

The text was updated successfully, but these errors were encountered:

hauntsaninja · 2023-09-18T08:52:08Z

It's a little bit of both. The first step we do in tokenising is regex splitting, where we split on things like whitespace. The ensuing chunks are then fed to BPE. When tokenising natural language, these chunks usually end up very small.

BPE is superlinear in chunk length, so what you're seeing is approximately expected. But the implementation in tiktoken is optimised for short chunk lengths where constant factors dominate, so time complexity as chunk length scales is worse than what it could be. Thus far, this hasn't mattered in practice, so I haven't bothered branching the logic and adding a bunch of complexity for "aaaaaaaaaaaa...". Do you have a use case, or are you just curious? :-)

vvolhejn · 2023-09-19T08:47:57Z

Thanks! The use case is that we're building an API that accepts text produced/consumed by GPT and we limit the length of the input to a certain number of tokens. Checking if the text is within the limit requires tokenizing it, and if tokenizing certain adversarial long strings takes a long time, this would be a way to DoS the server.

vvolhejn · 2023-09-19T08:48:43Z

Isn't this also an issue that OpenAI itself has to deal with?

hauntsaninja · 2023-09-19T09:05:16Z

Thanks, that's a reasonable use case. Every token is at least one byte, so for now I'd recommend running a quick length check against your input first. At typical current maximum context lengths, this should provide a reasonable guarantee.

I work on the research side of things, but OpenAI is probably pretty happy for you to pay us for these tokens since CPU time is basically free. We have other mitigations against denial of service as well.

vvolhejn · 2023-09-19T11:34:26Z

Every token is at least one byte, so for now I'd recommend running a quick length check against your input first.

But I guess the maximum token length is quite long, which is what I'd need to make statements like "for inputs longer than X bytes, they definitely have more than Y tokens so I can reject them"

The way we worked around this this in the meanwhile is that for extremely long prompts we require a certain number of spaces + other mitigations against DoS as you mention.

qxcv · 2023-09-26T07:15:13Z

If you want a more precise (but not perfectly exact) way to check, you can do something like this, which parses progressively larger portions of the input. This is much more precise than bounding the length of the input, and in practice tends to be reasonably fast (tiktoken parsing of small strings is insanely fast, so if your input is only a few thousand code points then the code below should take low single-digit milliseconds).

@dataclass
class TokenCheck:
    verification_passed: bool
    num_tokens: int


def get_number_tokens(message: str, limit: int, processing_size=500) -> TokenCheck:
    for end_index in range(0, len(message), processing_size):
        if len(encoding.encode(message[:end_index], disallowed_special=())) > limit:
            return TokenCheck(verification_passed=False, num_tokens=limit * 2)
    num_tokens = len(encoding.encode(message, disallowed_special=()))
    if num_tokens > limit:
        return TokenCheck(verification_passed=False, num_tokens=limit * 2)
    else:
        return TokenCheck(verification_passed=True, num_tokens=num_tokens)

We're using this check for a game that requires reasonably exact token counts because pushing the limits of the LLM is part of the game.

carlini · 2023-09-27T02:05:09Z

For what it's worth, I just ran into a bug with tiktoken taking crazy amount of time which broke some code I was working on trying to do data analysis. Here's a plot I generated of runtime across token length:

The ultimate problem was I was trying to encode a dataset of output generated by GPT-J with tiktoken, and the model just fell into one of its loops where it just wrote out a bunch of whitespace over and over. And this caused nearly 100,000 whitespace tokens to be tokenized back-to-back which just stalled my tiktoken encoding forever.

I think it would probably be a good idea to implement a fix for this. While tiktoken was probably designed for encoding GPT-4 output, it's not the only thing people use it for. And having superlinear runtimes can cause other people a lot of pain in ways they wouldn't expect.

vvolhejn · 2023-09-28T11:09:52Z

@qxcv Thanks! Looks like a good approximate workaround. Btw somebody sent Tensor Trust to me a few weeks ago and I really like the idea, although I haven't had time to play around with it yet. (We made https://gandalf.lakera.ai/, so we're working in the same space 🙂)

paplorinc · 2023-12-21T11:18:21Z

This can be solved by a different rank merge algorithm - I've fixed it in https://github.com/knuddelsgmbh/jtokkit (new version not released yet), might also contribute it back to tiktoken if it's an important usecase.

The blue is the current array based algo (doing linear minimum search) and the red is the new AVL tree based:

See: knuddelsgmbh/jtokkit#76

paplorinc · 2024-01-15T21:42:06Z

I pushed a PR here as well to tackle this exact scenario, see: #239

hauntsaninja closed this as completed Sep 18, 2023

hauntsaninja reopened this Sep 19, 2023

paplorinc mentioned this issue Jan 15, 2024

Optimize byte pair merge for really big tokens (40x faster for a 2500 token word) #239

Open

mikolalysenko mentioned this issue Apr 20, 2024

Replace js-tiktoken BPE merge algorithm with faster heap based algorithm dqbd/tiktoken#101

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very slow for inputs like "a" * 100000 #195

Very slow for inputs like "a" * 100000 #195

vvolhejn commented Sep 18, 2023

hauntsaninja commented Sep 18, 2023 •

edited

Loading

vvolhejn commented Sep 19, 2023

vvolhejn commented Sep 19, 2023

hauntsaninja commented Sep 19, 2023 •

edited

Loading

vvolhejn commented Sep 19, 2023

qxcv commented Sep 26, 2023 •

edited

Loading

carlini commented Sep 27, 2023

vvolhejn commented Sep 28, 2023

paplorinc commented Dec 21, 2023 •

edited

Loading

paplorinc commented Jan 15, 2024

Very slow for inputs like "a" * 100000 #195

Very slow for inputs like "a" * 100000 #195

Comments

vvolhejn commented Sep 18, 2023

hauntsaninja commented Sep 18, 2023 • edited Loading

vvolhejn commented Sep 19, 2023

vvolhejn commented Sep 19, 2023

hauntsaninja commented Sep 19, 2023 • edited Loading

vvolhejn commented Sep 19, 2023

qxcv commented Sep 26, 2023 • edited Loading

carlini commented Sep 27, 2023

vvolhejn commented Sep 28, 2023

paplorinc commented Dec 21, 2023 • edited Loading

paplorinc commented Jan 15, 2024

hauntsaninja commented Sep 18, 2023 •

edited

Loading

hauntsaninja commented Sep 19, 2023 •

edited

Loading

qxcv commented Sep 26, 2023 •

edited

Loading

paplorinc commented Dec 21, 2023 •

edited

Loading