Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate special token ids are in range when loading GGUF model #3635

Merged

Conversation

KerfuffleV2
Copy link
Collaborator

@KerfuffleV2 KerfuffleV2 commented Oct 15, 2023

This pull adds validation for special token ids that out of range. Invalid ones will be ignored and a warning message is display when loading the model. This model can reproduce the issue: https://huggingface.co/Undi95/Mistral-11B-OmniMix-bf16-GGUF (see issue for HF version with metadata).

I also included a small optimization for llama_byte_to_token SPM mode - snprintf is kind of overkill there since converting a uint8_t to hex is so trivial. Also slightly more friendly behavior if a BPM model is missing newline - assert rather than blindly dereferencing the tokenize result.

Of course, we shouldn't end up in a situation where the GGUF file has invalid token ids. I looked at convert.py and the GGUF Python side. The problem is that the special vocab handling stuff doesn't have access to other information like the model's vocab size so it can't validate the token ids.

I updated SpecialVocab to take an optional n_vocab argument so it can validate that the ids are in range (in-repo conversion scripts also updated to pass this). Refactored to make the logic a bit clearer as well.

Closes #3634

@KerfuffleV2 KerfuffleV2 marked this pull request as draft October 15, 2023 14:59
@staviq
Copy link
Collaborator

staviq commented Oct 15, 2023

Is the test workflow borked ?

main : failed test:    ' ('
main : detokenized to: ' (' instead of ' ('
main : expected tokens:  29871,    313, 
main : got tokens:       29871,    313, 

@KerfuffleV2
Copy link
Collaborator Author

KerfuffleV2 commented Oct 15, 2023

@staviq I was thinking that, but it was actually me that's borked. :(

Happens to everyone, especially if you find something exciting and care about it :)

@KerfuffleV2 KerfuffleV2 changed the title Valid special token ids are in range when loading GGUF model Validate special token ids are in range when loading GGUF model Oct 15, 2023
@KerfuffleV2 KerfuffleV2 marked this pull request as ready for review October 15, 2023 15:57
llama.cpp Outdated Show resolved Hide resolved
llama.cpp Show resolved Hide resolved
llama.cpp Show resolved Hide resolved
llama.cpp Outdated Show resolved Hide resolved
llama.cpp Outdated Show resolved Hide resolved
llama.cpp Outdated Show resolved Hide resolved
@KerfuffleV2
Copy link
Collaborator Author

I've been looking at the Python side and I actually can't replicate how the special EOS token got set to 32000. (It is fixed in the HF repo now, but even reverting to the previous version where it was set to 32000 doesn't result in creating a GGUF file with EOS set to 32000.)

Possibly it was converted using an older version. Also, I can't even get it to convert with the existing added_tokens.json which looks like:

{
  "</s>": 2,
  "<s>": 1,
  "<unk>": 0
}

I don't know if that's wrong, but the convert script expects added tokens to be... well, added. This is redefining tokens that are already in the main vocab, but convert expects the added tokens to come with ids past the normal vocab.

I think the error in convert.py is wrong though, it should be vocab_size not len(added_tokens) like:

        if expected_ids != actual_ids:
            raise Exception(f"Expected added token IDs to be sequential and start at {vocab_size}; got {actual_ids}")
Exception: Expected added token IDs to be sequential and start at 32000; got [0, 1, 2]

vs

Exception: Expected added token IDs to be sequential and start at 2; got [0, 1, 2]

(Unfortunately I think I'm responsible for that mistake.)

Anyway, I'm not even sure how that model converter even managed to convert the model to GGUF, maybe they hacked convert.py to bypass those checks.

@TheBloke
Copy link
Contributor

Yeah that error message confused me at first :)

I believe the issue with added_tokens < vocab_size is fixed by another PR though, so hopefully no need to look at that? #3585

I pointed out the issue in the base repo to Undi so he fixed it. As to how it happened in the first place - he may have had different config files at the time at the time he made his GGUFs, or yes it's entirely possible he hacked convert.py! I talked to him about it the other day and he couldn't remember how he made them or what config files he had at the time. And he never noticed the llama.cpp problem because he tested with SillyTavern, which must not have the same bug.

I did GGUFs for that model at the weekend, with the corrected config files (EOS should be 2, ie </s>, not 32000), so working files are available now. Even more confusingly, Undi's GGUFs were using the right EOS, ie 2, so he definitely didn't make them from the config files that were in the repo at the time I first made the models, which listed 32000 as EOS.

It's always an adventure with Undi models 🤣

@KerfuffleV2
Copy link
Collaborator Author

I believe the issue with added_tokens < vocab_size is fixed by another PR though, so hopefully no need to look at that?

Ah, thanks.

I did GGUFs for that model at the weekend, with the corrected config files

The thing that confused me is setting

  "eos_token_id": 32000,

in config.json actually doesn't result in the GGUF file having 32000 as the EOS token id. I can only replicate it if I edit tokenizer.json to look like:

{
  "added_tokens": [
    {
      "id": 32000,
      "content": "</s>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    }

It's always an adventure with Undi models

Haha, at least it's a reallly good model for being so small. Looking at the output, I can actually imagine it's coming from one of the weaker 70Bs which really hasn't been the case in the past even with 30Bs.

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs coordination with #3585 and we can merge

llama.cpp Outdated Show resolved Hide resolved
@KerfuffleV2
Copy link
Collaborator Author

Needs coordination with #3585 and we can merge

I don't think there should be a conflict, they're kind of doing different stuff. #3585 only touches convert.py while my Python changes are in gguf-py mainly (and the part that touches convert.py is just for calling SpecialVocab really).

Pretty sure it wouldn't even cause a merge conflict.

@KerfuffleV2 KerfuffleV2 force-pushed the fix-handle-bad-special-token-ids branch from 9d14041 to 76b05fc Compare October 17, 2023 19:04
@KerfuffleV2
Copy link
Collaborator Author

Is this actually approved or am I missing something about the #3585 coordination issue?

@ggerganov ggerganov merged commit a5e7dbd into ggerganov:master Oct 22, 2023
32 checks passed
@KerfuffleV2 KerfuffleV2 deleted the fix-handle-bad-special-token-ids branch November 17, 2023 03:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Special token ids aren't validated when loading a model
5 participants