Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very slow for inputs like "a" * 100000 #195

Open
vvolhejn opened this issue Sep 18, 2023 · 10 comments
Open

Very slow for inputs like "a" * 100000 #195

vvolhejn opened this issue Sep 18, 2023 · 10 comments

Comments

@vvolhejn
Copy link

I've noticed that Tiktoken is really slow for strings of repeated characters like "a" * 100_000. Interestingly, when you add spaces, like "a " * 50_000, the performance is orders of magnitude better:

Screenshot 2023-09-18 at 09 56 21

Is this a bug or a fundamental property of BPE?

My Tiktoken version is 0.5.0.

@hauntsaninja
Copy link
Collaborator

hauntsaninja commented Sep 18, 2023

It's a little bit of both. The first step we do in tokenising is regex splitting, where we split on things like whitespace. The ensuing chunks are then fed to BPE. When tokenising natural language, these chunks usually end up very small.

BPE is superlinear in chunk length, so what you're seeing is approximately expected. But the implementation in tiktoken is optimised for short chunk lengths where constant factors dominate, so time complexity as chunk length scales is worse than what it could be. Thus far, this hasn't mattered in practice, so I haven't bothered branching the logic and adding a bunch of complexity for "aaaaaaaaaaaa...". Do you have a use case, or are you just curious? :-)

@vvolhejn
Copy link
Author

Thanks! The use case is that we're building an API that accepts text produced/consumed by GPT and we limit the length of the input to a certain number of tokens. Checking if the text is within the limit requires tokenizing it, and if tokenizing certain adversarial long strings takes a long time, this would be a way to DoS the server.

@vvolhejn
Copy link
Author

Isn't this also an issue that OpenAI itself has to deal with?

@hauntsaninja
Copy link
Collaborator

hauntsaninja commented Sep 19, 2023

Thanks, that's a reasonable use case. Every token is at least one byte, so for now I'd recommend running a quick length check against your input first. At typical current maximum context lengths, this should provide a reasonable guarantee.

I work on the research side of things, but OpenAI is probably pretty happy for you to pay us for these tokens since CPU time is basically free. We have other mitigations against denial of service as well.

@hauntsaninja hauntsaninja reopened this Sep 19, 2023
@vvolhejn
Copy link
Author

Every token is at least one byte, so for now I'd recommend running a quick length check against your input first.

But I guess the maximum token length is quite long, which is what I'd need to make statements like "for inputs longer than X bytes, they definitely have more than Y tokens so I can reject them"

The way we worked around this this in the meanwhile is that for extremely long prompts we require a certain number of spaces + other mitigations against DoS as you mention.

@qxcv
Copy link

qxcv commented Sep 26, 2023

If you want a more precise (but not perfectly exact) way to check, you can do something like this, which parses progressively larger portions of the input. This is much more precise than bounding the length of the input, and in practice tends to be reasonably fast (tiktoken parsing of small strings is insanely fast, so if your input is only a few thousand code points then the code below should take low single-digit milliseconds).

@dataclass
class TokenCheck:
    verification_passed: bool
    num_tokens: int


def get_number_tokens(message: str, limit: int, processing_size=500) -> TokenCheck:
    for end_index in range(0, len(message), processing_size):
        if len(encoding.encode(message[:end_index], disallowed_special=())) > limit:
            return TokenCheck(verification_passed=False, num_tokens=limit * 2)
    num_tokens = len(encoding.encode(message, disallowed_special=()))
    if num_tokens > limit:
        return TokenCheck(verification_passed=False, num_tokens=limit * 2)
    else:
        return TokenCheck(verification_passed=True, num_tokens=num_tokens)

We're using this check for a game that requires reasonably exact token counts because pushing the limits of the LLM is part of the game.

@carlini
Copy link

carlini commented Sep 27, 2023

For what it's worth, I just ran into a bug with tiktoken taking crazy amount of time which broke some code I was working on trying to do data analysis. Here's a plot I generated of runtime across token length:

image

The ultimate problem was I was trying to encode a dataset of output generated by GPT-J with tiktoken, and the model just fell into one of its loops where it just wrote out a bunch of whitespace over and over. And this caused nearly 100,000 whitespace tokens to be tokenized back-to-back which just stalled my tiktoken encoding forever.

I think it would probably be a good idea to implement a fix for this. While tiktoken was probably designed for encoding GPT-4 output, it's not the only thing people use it for. And having superlinear runtimes can cause other people a lot of pain in ways they wouldn't expect.

@vvolhejn
Copy link
Author

@qxcv Thanks! Looks like a good approximate workaround. Btw somebody sent Tensor Trust to me a few weeks ago and I really like the idea, although I haven't had time to play around with it yet. (We made https://gandalf.lakera.ai/, so we're working in the same space 🙂)

@paplorinc
Copy link
Contributor

paplorinc commented Dec 21, 2023

This can be solved by a different rank merge algorithm - I've fixed it in https://github.com/knuddelsgmbh/jtokkit (new version not released yet), might also contribute it back to tiktoken if it's an important usecase.

The blue is the current array based algo (doing linear minimum search) and the red is the new AVL tree based:
image

See: knuddelsgmbh/jtokkit#76

@paplorinc
Copy link
Contributor

I pushed a PR here as well to tackle this exact scenario, see: #239

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants