Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

when i use tokenizer , I obtained many patterns that span across the data, which is quite strange. #39

Open
gawei1995 opened this issue Mar 8, 2024 · 4 comments

Comments

@gawei1995
Copy link

gawei1995 commented Mar 8, 2024

just like the pattern del offset is [7511, 8038] ,but the doc start ,end is [6604 7516]
this data accounts for 0.9% of the total

debug del offset doc start-end
error [7511, 8038] 6604 7516
error [221144, 221346] 217374 221220
error [239409, 240382] 236682 239466
error [262775, 263050] 254246 262838
error [268452, 270316] 262838 268722
error [288764, 289772] 286864 288954
@carlini
Copy link
Collaborator

carlini commented Mar 16, 2024

This deduplicator doesn't know anything about documents. It just knows strings. Do you have a document separator that you use that's not present in any of the documents? (e.g., if you have a tokenizer with <65k tokens you can use \xff\xff\xff\xff as a separator.

@gawei1995
Copy link
Author

This deduplicator doesn't know anything about documents. It just knows strings. Do you have a document separator that you use that's not present in any of the documents? (e.g., if you have a tokenizer with <65k tokens you can use \xff\xff\xff\xff as a separator.该重复数据删除器对文档一无所知。它只知道字符串。您使用的文档分隔符是否存在于任何文档中? (例如,如果您的分词器具有 <65k 标记,则可以使用 \xff\xff\xff\xff 作为分隔符。

i use the \xff\xff as a separator . the tokenizer is gpt2 with <51k. Is there a big difference between "\xff\xff" and "\xff\xff\xff\xff"? thx for reply

@carlini
Copy link
Collaborator

carlini commented Mar 18, 2024

Huh. If you can be sure that 0xff00 isn't a valid token then \xff\xff should work because you should never be able get away with 2. Do you put a unique counter between documents as well?

Otherwise it could match [final bit of document 1][document separator][beginning of document 2] to a document 3/4 if those were in the same positions.

@gawei1995
Copy link
Author

Huh. If you can be sure that 0xff00 isn't a valid token then \xff\xff should work because you should never be able get away with 2. Do you put a unique counter between documents as well?呵呵。如果您可以确定 0xff00 不是有效令牌,那么 \xff\xff 应该可以工作,因为您永远无法逃脱 2. 您是否也在文档之间放置了唯一的计数器?

Otherwise it could match [final bit of document 1][document separator][beginning of document 2] to a document 3/4 if those were in the same positions.否则,它可以将[文档 1 的最后一位][文档分隔符][文档 2 的开头]与文档 3/4 匹配(如果它们位于相同的位置)。

maybe,I'll try it. thx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants