Speed up + streamline prompt template rendering runtime #1286

haileyschoelkopf · 2024-01-15T17:31:12Z

In some situations (for example, N-shot with very large N) we currently spend an annoying amount of time redundantly rendering our Jinja2 prompt template strings.

There are two ways we can significantly cut down on this runtime, without impacting code complexity significantly (e.g., no multiprocessing when generating requests / Instances, etc.)

At the start, render doc_to_target, doc_to_text, doc_to_choice once each for every doc, and cache that result so that we don't need to repeatedly render our prompts when we want to call these methods multiple times for every test instance.
cache these attributes, or the actual Instance objects, created when a task is run once, so that we can skip this whole step's processing in future. The easiest way to do so would be to leverage HF datasets' dataset.map() function which can cache results based on pickling a function. However, we'd want to be extremely cautious of whether this might create weird bugs when someone is trying to change + rerun a task but the cached intermediate input objects get silently used, preventing them from seeing the expected changes in behavior when they change code.

Change 1 should suffice for some pretty nice QoL improvements.
We should however also make sure this doesn't negatively affect runtime when we'd have to perform it on a large train set that we're drawing few-shot examples from.

The text was updated successfully, but these errors were encountered:

haileyschoelkopf · 2024-01-15T19:08:41Z

I'll probably take on change 1 sooner rather than later--but if no PR can be seen that references this issue yet, then anyone interested can comment here volunteering to add it too!

lhoestq · 2024-01-29T11:26:58Z

Hi ! Happy to help if you have any question regarding datasets hashing for map() functions. :)

FYI it dumps the function using dill (an alternative to pickle that can handle e.g. lambda functions) and recursive=True to also take into account the globals used in the function. The hash is computed using the dill dump to build the fingerprint of the resulting dataset and store it in the cache, see https://huggingface.co/docs/datasets/v2.16.1/en/about_cache#fingerprint

Though you can override this mechanism by passing your own new_fingerprint= to map() but in this case you want to make sure that the fingerprint is not the same whenever a processing parameter changes.

haileyschoelkopf added help wanted Contributors and extra help welcome. feature request A feature that isn't implemented yet. labels Jan 15, 2024

haileyschoelkopf mentioned this issue Jan 24, 2024

Is there a way to cache the building of datasets? #1344

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up + streamline prompt template rendering runtime #1286

Speed up + streamline prompt template rendering runtime #1286

haileyschoelkopf commented Jan 15, 2024

haileyschoelkopf commented Jan 15, 2024

lhoestq commented Jan 29, 2024

Speed up + streamline prompt template rendering runtime #1286

Speed up + streamline prompt template rendering runtime #1286

Comments

haileyschoelkopf commented Jan 15, 2024

haileyschoelkopf commented Jan 15, 2024

lhoestq commented Jan 29, 2024