Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deal with _encode_pair() / Llama token 29871 / SPIECE_UNDERLINE better #1322

Draft
wants to merge 32 commits into
base: main
Choose a base branch
from
Draft
Changes from 1 commit
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
3824828
first stab at wrap_chat_template
daniel-furman Jan 7, 2024
a784417
first stab at wrap_chat_template, strip error fix
daniel-furman Jan 7, 2024
53c68db
first stab at wrap_chat_template, rfind continuation fix
daniel-furman Jan 7, 2024
3e27f9d
first stab at wrap_chat_template, formatting in function
daniel-furman Jan 7, 2024
87dff8b
first stab at wrap_chat_template, print statements in loglikelihood f…
daniel-furman Jan 7, 2024
5c4d9c7
first stab at wrap_chat_template, remove system for now
daniel-furman Jan 7, 2024
e689727
first stab at wrap_chat_template, remove special chars from continuation
daniel-furman Jan 10, 2024
337c084
first stab at wrap_chat_template, remove special chars tab indenting …
daniel-furman Jan 10, 2024
6c68fd1
Merge branch 'EleutherAI:main' into main
daniel-furman Jan 10, 2024
34b32f7
first stab at wrap_chat_template, various
daniel-furman Jan 10, 2024
59e3b17
first stab at wrap_chat_template, various
daniel-furman Jan 10, 2024
7191904
first stab at wrap_chat_template, arc conversation test
daniel-furman Jan 10, 2024
9949e4f
first stab at wrap_chat_template, arc conversation test
daniel-furman Jan 10, 2024
2d3c835
first stab at wrap_chat_template, remove arc experiment
daniel-furman Jan 10, 2024
49f43f9
first stab at wrap_chat_template, various
daniel-furman Jan 10, 2024
021232b
llama test
daniel-furman Jan 11, 2024
b6c75ed
llama test
daniel-furman Jan 11, 2024
047dde8
llama test
daniel-furman Jan 11, 2024
c38b9d2
llama test
daniel-furman Jan 11, 2024
1ea8470
llama test
daniel-furman Jan 11, 2024
2e27053
llama test
daniel-furman Jan 11, 2024
43dee06
llama test
daniel-furman Jan 13, 2024
39a11d0
llama test
daniel-furman Jan 13, 2024
bbcdffb
remove system
daniel-furman Jan 13, 2024
2b40017
Merge branch 'main' into add-chat-templating
haileyschoelkopf Jan 15, 2024
c47de8b
update Instance.args setter
haileyschoelkopf Jan 15, 2024
6ca8ab1
clean up wrap_chat_template + add TODOs
haileyschoelkopf Jan 15, 2024
b8bda47
Merge branch 'main' into add-chat-templating
haileyschoelkopf Jan 15, 2024
68c30aa
push most recent code
haileyschoelkopf Jan 16, 2024
d03c9fd
add the hack (works for Mistral/Llama, destroys performance for GPT2
haileyschoelkopf Jan 19, 2024
42d54f8
add the hack (works for Mistral/Llama, destroys performance for GPT2
haileyschoelkopf Jan 19, 2024
787c99e
Merge branch 'fix-len0-continuations' of https://github.com/EleutherA…
haileyschoelkopf Jan 19, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
add the hack (works for Mistral/Llama, destroys performance for GPT2
  • Loading branch information
haileyschoelkopf committed Jan 19, 2024
commit d03c9fdeec7cce9fa95cc3048211e0f35d3b7f1f
20 changes: 20 additions & 0 deletions lm_eval/models/huggingface.py
Original file line number Diff line number Diff line change
Expand Up @@ -792,6 +792,26 @@ def _encode_pair(
# context_enc = self.tok_encode(context, add_special_tokens=False)
context_enc_len = len(context_enc)
continuation_enc = whole_enc[context_enc_len:]

# quite the hack, but what this does:
# circumvents the addition of an extraneous sentencepiece underline token
# that was produced when passing " <word>" into the Llama / Mistral tokenizer.
# if instead we pass "<word>" in, we don't get this extra token (29871 for Llama.)
# which would hurt performance if provided.
if (
len(continuation.lstrip()) + 1 == len(continuation)
and continuation.startswith(" ")
) or (len(continuation_enc) == 0):
context_enc_2 = context_enc
continuation_enc_2 = self.tok_encode(
continuation[1:], add_special_tokens=False
)

# assert context_enc == context_enc_2
# assert continuation_enc == continuation_enc_2, f"{continuation_enc},{continuation_enc_2}"

return context_enc_2, continuation_enc_2

return context_enc, continuation_enc

def loglikelihood(self, requests: List[Instance]) -> List[Tuple[float, bool]]:
Expand Down
Loading