Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer merging overhaul #334

Merged
merged 12 commits into from
Jul 15, 2024
Merged

Tokenizer merging overhaul #334

merged 12 commits into from
Jul 15, 2024

Conversation

cg123
Copy link
Collaborator

@cg123 cg123 commented May 31, 2024

Rewrite the tokenizer merging logic to support all merge methods and allow more customization of behavior.

The previous implementation of tokenizer merging always used either linear or slerp to combine the embedding/LM head parameters. This was to avoid the complexity that would be required to make all merge methods support tensors that potentially have invalid or masked out values. It works okay for some cases but wasn't a general solution.

In this implementation, instead of overriding the merge method for embed/lm_head a preprocessing step remaps them to the vocabulary used by the output model. These (now appropriately sized and ordered) tensors are then merged normally.

The selection of embedding values for tokens not normally present in a model is where things get slightly tricky. By default a set of heuristics that I think are sane are applied. For a given token and model, if the token is not present in the model's original tokenizer:

  • If the base model has this token present, the base model's embedding is used
  • If only one model in the merge has the token, that model's embedding is used
  • Otherwise, the average of all embeddings for the token is assumed as a default value

This can also be overridden on a per-token level. For example:

merge_method: dare_ties
base_model: ...
models:
  - model: some_chatml_model
  - model: some_weird_model
  - model: some_model
tokenizer:
  source: union
  tokens:
    # if model doesn't have <|im_start|>, use embedding from some_chatml_model
    <|im_start|>:
      source: some_chatml_model
    # use embedding of <|special|> from some_weird_model for *all* models
    <|special|>:
      source: some_weird_model
      force: true
    # output tokenizer will have <|renamed_token|> with embedding of <|original_token|>
    # from some_model
    <|renamed_token|>:
      source:
        kind: model_token
        model: some_model
        token: <|original_token|>
      force: true

A practical example would be for merging two Llama 3 models, one using the Llama 3 Instruct prompt format and one using chatml, trying to preserve the ability to use both formats:

tokenizer:
  source: union
  tokens:
    <|im_start|>:
      source: chatml_model
    <|im_end|>:
      source: chatml_model
    <|start_header_id|>:
      source: llama3_model
      force: true
    <|end_header_id|>:
      source: llama3_model
      force: true
    <|eot_id|>:
      source: llama3_model
      force: true

trust_remote_code=options.trust_remote_code,
add_tokens=tuple(token_cfg.keys()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this where tokens are added from a prompt template that might have extras? Is the tricky part making sure these don't shift the tokenizer for other "cross-known" tokens?

has_token = [p[token_id] >= 0 for p in permutation_list]
num_present = sum(int(x) for x in has_token)
if num_present == 1:
donor_model = models[has_token.index(True)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🏥

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you get donated a token that may not have been in your understanding?


if num_present == 0:
token_configs[token] = TokenEmbeddingConfig(source=ZeroEmbedding())
logging.warning(f"Token {repr(token)} not found in any model")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what world might one encounter this?

@Jacobsolawetz
Copy link
Contributor

Default behavior for embedding discovery seems reasonable as stated!

@thomasgauthier
Copy link
Contributor

This seems good to me!

As a bonus for other reviewers, here's an algorithm description I asked GPT-4o to write based on the PR:


Embedding Merging Algorithm

Given:

  • A set of models ${M_1, M_2, \ldots, M_n}$.
  • Tokenizers for each model producing vocabularies $V_1, V_2, \ldots, V_n$.
  • Embedding matrices $E_1, E_2, \ldots, E_n$ corresponding to these models.
  • A unified tokenizer producing a vocabulary $V$ of size $|V|$.

The objective is to produce a unified embedding matrix $E$ for the unified vocabulary $V$.

Tokenizer Merging Algorithm

  1. Initialization:

    • Let $V$ be the unified vocabulary with size $|V|$.
    • Initialize $E$ as a zero matrix of size $|V| \times d$, where $d$ is the embedding dimension.
  2. Embedding Assignment:

    • For each token $t \in V$, determine its source model(s) and embedding(s):
      • If $t$ is present in only one model, use its embedding directly.
      • If $t$ is present in multiple models, average the embeddings.
      • If $t$ is specified with a particular source configuration, use that embedding.
  3. Embedding Computation:

    • For each token $t \in V$:
      1. Let $I_t$ be the set of indices where $t$ appears in the vocabularies $V_1, V_2, \ldots, V_n$.
      2. Compute the embedding for $t$ as follows:
        $$E_t = \frac{1}{|I_t|} \sum_{i \in I_t} E_i[t]$$
        where $E_i[t]$ is the embedding of $t$ in model $i$.
  4. Force Embedding Assignment:

    • If a token $t$ has a forced source specified, override the computed embedding with the specified one.

@cg123 cg123 merged commit 4c3532c into main Jul 15, 2024
4 of 6 checks passed
@cg123 cg123 deleted the tokenizer-again branch July 15, 2024 19:59
@Jacobsolawetz
Copy link
Contributor

@cg123 I forgot you had this one opened a while ago!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants