Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up + streamline prompt template rendering runtime #1286

Open
haileyschoelkopf opened this issue Jan 15, 2024 · 2 comments
Open

Speed up + streamline prompt template rendering runtime #1286

haileyschoelkopf opened this issue Jan 15, 2024 · 2 comments
Labels
feature request A feature that isn't implemented yet. help wanted Contributors and extra help welcome.

Comments

@haileyschoelkopf
Copy link
Contributor

In some situations (for example, N-shot with very large N) we currently spend an annoying amount of time redundantly rendering our Jinja2 prompt template strings.

There are two ways we can significantly cut down on this runtime, without impacting code complexity significantly (e.g., no multiprocessing when generating requests / Instances, etc.)

  1. At the start, render doc_to_target, doc_to_text, doc_to_choice once each for every doc, and cache that result so that we don't need to repeatedly render our prompts when we want to call these methods multiple times for every test instance.
  2. cache these attributes, or the actual Instance objects, created when a task is run once, so that we can skip this whole step's processing in future. The easiest way to do so would be to leverage HF datasets' dataset.map() function which can cache results based on pickling a function. However, we'd want to be extremely cautious of whether this might create weird bugs when someone is trying to change + rerun a task but the cached intermediate input objects get silently used, preventing them from seeing the expected changes in behavior when they change code.

Change 1 should suffice for some pretty nice QoL improvements.
We should however also make sure this doesn't negatively affect runtime when we'd have to perform it on a large train set that we're drawing few-shot examples from.

@haileyschoelkopf haileyschoelkopf added help wanted Contributors and extra help welcome. feature request A feature that isn't implemented yet. labels Jan 15, 2024
@haileyschoelkopf
Copy link
Contributor Author

I'll probably take on change 1 sooner rather than later--but if no PR can be seen that references this issue yet, then anyone interested can comment here volunteering to add it too!

@lhoestq
Copy link
Contributor

lhoestq commented Jan 29, 2024

Hi ! Happy to help if you have any question regarding datasets hashing for map() functions. :)

FYI it dumps the function using dill (an alternative to pickle that can handle e.g. lambda functions) and recursive=True to also take into account the globals used in the function. The hash is computed using the dill dump to build the fingerprint of the resulting dataset and store it in the cache, see https://huggingface.co/docs/datasets/v2.16.1/en/about_cache#fingerprint

Though you can override this mechanism by passing your own new_fingerprint= to map() but in this case you want to make sure that the fingerprint is not the same whenever a processing parameter changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request A feature that isn't implemented yet. help wanted Contributors and extra help welcome.
Projects
Status: Ready
Development

No branches or pull requests

2 participants