Tokenizer merging overhaul #334

cg123 · 2024-05-31T02:35:11Z

Rewrite the tokenizer merging logic to support all merge methods and allow more customization of behavior.

The previous implementation of tokenizer merging always used either linear or slerp to combine the embedding/LM head parameters. This was to avoid the complexity that would be required to make all merge methods support tensors that potentially have invalid or masked out values. It works okay for some cases but wasn't a general solution.

In this implementation, instead of overriding the merge method for embed/lm_head a preprocessing step remaps them to the vocabulary used by the output model. These (now appropriately sized and ordered) tensors are then merged normally.

The selection of embedding values for tokens not normally present in a model is where things get slightly tricky. By default a set of heuristics that I think are sane are applied. For a given token and model, if the token is not present in the model's original tokenizer:

If the base model has this token present, the base model's embedding is used
If only one model in the merge has the token, that model's embedding is used
Otherwise, the average of all embeddings for the token is assumed as a default value

This can also be overridden on a per-token level. For example:

merge_method: dare_ties
base_model: ...
models:
  - model: some_chatml_model
  - model: some_weird_model
  - model: some_model
tokenizer:
  source: union
  tokens:
    # if model doesn't have <|im_start|>, use embedding from some_chatml_model
    <|im_start|>:
      source: some_chatml_model
    # use embedding of <|special|> from some_weird_model for *all* models
    <|special|>:
      source: some_weird_model
      force: true
    # output tokenizer will have <|renamed_token|> with embedding of <|original_token|>
    # from some_model
    <|renamed_token|>:
      source:
        kind: model_token
        model: some_model
        token: <|original_token|>
      force: true

A practical example would be for merging two Llama 3 models, one using the Llama 3 Instruct prompt format and one using chatml, trying to preserve the ability to use both formats:

tokenizer:
  source: union
  tokens:
    <|im_start|>:
      source: chatml_model
    <|im_end|>:
      source: chatml_model
    <|start_header_id|>:
      source: llama3_model
      force: true
    <|end_header_id|>:
      source: llama3_model
      force: true
    <|eot_id|>:
      source: llama3_model
      force: true

Jacobsolawetz · 2024-06-03T01:54:35Z

mergekit/plan.py

 trust_remote_code=options.trust_remote_code,
+ add_tokens=tuple(token_cfg.keys()),


Is this where tokens are added from a prompt template that might have extras? Is the tricky part making sure these don't shift the tokenizer for other "cross-known" tokens?

Jacobsolawetz · 2024-06-03T01:57:02Z

mergekit/tokenizer/embed.py

+ has_token = [p[token_id] >= 0 for p in permutation_list]
+ num_present = sum(int(x) for x in has_token)
+ if num_present == 1:
+ donor_model = models[has_token.index(True)]


So you get donated a token that may not have been in your understanding?

Jacobsolawetz · 2024-06-03T01:59:01Z

mergekit/tokenizer/embed.py

+
+ if num_present == 0:
+ token_configs[token] = TokenEmbeddingConfig(source=ZeroEmbedding())
+ logging.warning(f"Token {repr(token)} not found in any model")


In what world might one encounter this?

Jacobsolawetz · 2024-06-03T02:10:21Z

Default behavior for embedding discovery seems reasonable as stated!

thomasgauthier · 2024-06-03T19:01:27Z

This seems good to me!

As a bonus for other reviewers, here's an algorithm description I asked GPT-4o to write based on the PR:

Embedding Merging Algorithm

Given:

A set of models ${M_1, M_2, \ldots, M_n}$.
Tokenizers for each model producing vocabularies $V_1, V_2, \ldots, V_n$.
Embedding matrices $E_1, E_2, \ldots, E_n$ corresponding to these models.
A unified tokenizer producing a vocabulary $V$ of size $|V|$.

The objective is to produce a unified embedding matrix $E$ for the unified vocabulary $V$.

Tokenizer Merging Algorithm

Initialization:
- Let $V$ be the unified vocabulary with size $|V|$.
- Initialize $E$ as a zero matrix of size $|V| \times d$, where $d$ is the embedding dimension.
Embedding Assignment:
- For each token $t \in V$, determine its source model(s) and embedding(s):
  - If $t$ is present in only one model, use its embedding directly.
  - If $t$ is present in multiple models, average the embeddings.
  - If $t$ is specified with a particular source configuration, use that embedding.
Embedding Computation:
- For each token $t \in V$:
  1. Let $I_t$ be the set of indices where $t$ appears in the vocabularies $V_1, V_2, \ldots, V_n$.
  2. Compute the embedding for $t$ as follows:
    $$E_t = \frac{1}{|I_t|} \sum_{i \in I_t} E_i[t]$$
    where $E_i[t]$ is the embedding of $t$ in model $i$.
Force Embedding Assignment:
- If a token $t$ has a forced source specified, override the computed embedding with the specified one.

Jacobsolawetz · 2024-07-15T20:30:25Z

@cg123 I forgot you had this one opened a while ago!

cg123 added 10 commits May 30, 2024 16:18

WIP rework tokenizer merging

c1de1ad

Maybe working?

dfd3e51

Cleanup

4fa113f

Fix

4bb7524

Tweak

e41e671

Fixes, update tests

ed3d745

Allow specifying by token text

2a14e07

Reorganize a li'l

dc483db

Fix task arithmetic based methods

42e200f

Fix

eb3fa19

Jacobsolawetz reviewed Jun 3, 2024

View reviewed changes

cg123 mentioned this pull request Jun 11, 2024

Need some help in merging same architectures, but with different tokens in their tokenizers #342

Open

cg123 added 2 commits June 28, 2024 20:51

Merge branch 'main' into tokenizer-again

cd4d4bf

Merge branch 'main' into tokenizer-again

7d4d24c

cg123 merged commit 4c3532c into main Jul 15, 2024
4 of 6 checks passed

cg123 deleted the tokenizer-again branch July 15, 2024 19:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer merging overhaul #334

Tokenizer merging overhaul #334

cg123 commented May 31, 2024

Jacobsolawetz Jun 3, 2024

Jacobsolawetz Jun 3, 2024

Jacobsolawetz Jun 3, 2024

Jacobsolawetz Jun 3, 2024

Jacobsolawetz commented Jun 3, 2024

thomasgauthier commented Jun 3, 2024

Jacobsolawetz commented Jul 15, 2024

		trust_remote_code=options.trust_remote_code,
		add_tokens=tuple(token_cfg.keys()),

Tokenizer merging overhaul #334

Tokenizer merging overhaul #334

Conversation

cg123 commented May 31, 2024

Jacobsolawetz Jun 3, 2024

Choose a reason for hiding this comment

Jacobsolawetz Jun 3, 2024

Choose a reason for hiding this comment

Jacobsolawetz Jun 3, 2024

Choose a reason for hiding this comment

Jacobsolawetz Jun 3, 2024

Choose a reason for hiding this comment

Jacobsolawetz commented Jun 3, 2024

thomasgauthier commented Jun 3, 2024

Embedding Merging Algorithm

Tokenizer Merging Algorithm

Jacobsolawetz commented Jul 15, 2024