handle tokenization more correctly, clean up CI #62

charlesfrye · 2024-05-30T04:07:15Z

No description provided.

charlesfrye · 2024-05-30T06:31:51Z

This PR improves the CI job (and SQL QA training in general) in several ways. Tokenization is the most important. Tokenization is still not guaranteed to be identical between axolotl and vllm, which merits some more investigation.

Summary of changes:

All of the SQL QA configs now have custom tokens (e.g. [SQL]) added to improve the tokenization of inputs.
All models are switched over to the default, AutoTokenizer, which defers lookup of the tokenizer to hugging face. This substantially improved the quality of results, but I didn't root cause.
The prep for the "memorization test" in CI is adjusted to speed up runs.
llama-2.yml is removed, since we now have llama-3.yml.
Extra config keys are removed from some of the configs.
Switches to A100s instead of H100s as the default GPU to reduce time spent queuing, in exchange for slightly slower training.

charlesfrye added 9 commits May 30, 2024 04:06

handle tokenization more correctly, clean up CI

2527625

fix codellama module saving

3f5f018

unify loss check now that pythia training is well-tuned

2c65751

switch to a100s by default

1606515

adjust epoch counts

1f3e482

resolve issue with codellama

b7a7d10

save model only at end of training in ci

fb7d9b7

25 epochs for all models except pythia

62e612d

give codellama more room breathe in CI

9f63e2c

charlesfrye merged commit f64c8d7 into main May 30, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

handle tokenization more correctly, clean up CI #62

handle tokenization more correctly, clean up CI #62

charlesfrye commented May 30, 2024

charlesfrye commented May 30, 2024

handle tokenization more correctly, clean up CI #62

handle tokenization more correctly, clean up CI #62

Conversation

charlesfrye commented May 30, 2024

charlesfrye commented May 30, 2024