-
Notifications
You must be signed in to change notification settings - Fork 790
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak for large strings #1539
Comments
Related to #1495 |
Hello! I am also experiencing a memory leak with these tokenizers when processing long sequences without any spaces. This has been reported as a memory leak in Sentence Transformers, and affects some of my users: UKPLab/sentence-transformers#1795 Reproductionimport random
import string
import time
import psutil
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_pretrained('xlm-roberta-base')
def random_string(length: int) -> str:
return ''.join(random.choices(string.ascii_uppercase + string.digits, k=length))
for iteration in range(99999999):
start_t = time.time()
tokenizer.encode_batch([random_string(12345) for _ in range(200)])
memory_usage_in_MiB = psutil.Process().memory_info().rss / (1024 * 1024)
delta_t = time.time() - start_t
print(f"{iteration:02d}: {memory_usage_in_MiB:.2f}MB, {delta_t:.2f}s") Outputs
This is rather severe, not just a massive growth in memory usage, but the tokenization speed is also much, much lower. NotesThe memory usage is much more reasonable if the strings:
|
I will check this might be related to FFI (Foreign Function Interface) and the way string are passed to rust in the background. |
+1 on facing this issue. Happy to help in any way to get this fixed! |
FWIW, it appears to leak even if |
ah then it might be the interface between rust and python |
https://dora-rs.ai/blog/rust-python/ I'll try to follow that |
if anyone has a fix feel free to open a PR! |
I'm also getting memory leak. Memory keeps growing until the program crashes. I think this should be a high priority bug to be fixed. |
Will investigate, do you have a reproducer as well? Can help figuring out the extent of the bug |
Using a concrete dataset from transformers import AutoTokenizer
from huggingface_hub import hf_hub_download
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
filepath = hf_hub_download("roneneldan/TinyStories", "TinyStoriesV2-GPT4-train.txt", repo_type="dataset")
stories = open(filepath).read().split("\n<|endoftext|>\n")
print(len(stories)) # 2,717,700
outputs = []
chunk_size = 10_000
for i in range(0, len(stories), chunk_size):
chunk = stories[i : min(i + chunk_size, len(stories))]
# memory increases 1GB every 2-3s
outputs.append(tokenizer(chunk, return_attention_mask=False))
# memory increases at a much slower rate, but might still be abnormal
# outputs.append(tokenizer(chunk, return_attention_mask=False).input_ids) The final data is 587,316,317 tokens, which can be fit in memory (using int64, it is ~4GB). In the end I switched to sentencepiece to tokenize data instead. |
Using the tokenizer version rust0.19.1 also has the same problem, is there any progress? @ArthurZucker |
I would recommend you to first test with https://github.com/huggingface/tokenizers/releases (0.20.0) but will investigate. |
I can confirm my snippet above still has memory leak for tokenizers=0.20.0 |
This snippet will cause memory usage to rise indefinitely:
If you set
refresh_every
to 100000 (like it is in the snippet), the memory usage will keep on rising. This colab notebook crashes after about 15 minutes of executing.If you set
refresh_every
to 100, the memory consumption will be stable.The text was updated successfully, but these errors were encountered: