Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handle tokenization more correctly, clean up CI #62

Merged
merged 9 commits into from
May 30, 2024

Conversation

charlesfrye
Copy link
Contributor

No description provided.

@charlesfrye
Copy link
Contributor Author

This PR improves the CI job (and SQL QA training in general) in several ways. Tokenization is the most important. Tokenization is still not guaranteed to be identical between axolotl and vllm, which merits some more investigation.

Summary of changes:

  • All of the SQL QA configs now have custom tokens (e.g. [SQL]) added to improve the tokenization of inputs.
  • All models are switched over to the default, AutoTokenizer, which defers lookup of the tokenizer to hugging face. This substantially improved the quality of results, but I didn't root cause.
  • The prep for the "memorization test" in CI is adjusted to speed up runs.
  • llama-2.yml is removed, since we now have llama-3.yml.
  • Extra config keys are removed from some of the configs.
  • Switches to A100s instead of H100s as the default GPU to reduce time spent queuing, in exchange for slightly slower training.

@charlesfrye charlesfrye merged commit f64c8d7 into main May 30, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant