Memory leak for large strings #1539

noamgai21 · 2024-05-23T06:36:13Z

This snippet will cause memory usage to rise indefinitely:

from transformers import AutoTokenizer
import gc

tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0", use_fast=True)
refresh_every = 100000

for i in range(100000):
  s = f'{i} {i} ' * 10000
  tokenizer.encode(s)
  gc.collect()
  if i % 100 == 0:
    print(i)
  if i % refresh_every == 0:
    tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0", use_fast=True)

If you set refresh_every to 100000 (like it is in the snippet), the memory usage will keep on rising. This colab notebook crashes after about 15 minutes of executing.

If you set refresh_every to 100, the memory consumption will be stable.

The text was updated successfully, but these errors were encountered:

noamgai21 · 2024-05-23T06:36:32Z

Related to #1495

tomaarsen · 2024-06-18T11:47:08Z

Hello!

I am also experiencing a memory leak with these tokenizers when processing long sequences without any spaces. This has been reported as a memory leak in Sentence Transformers, and affects some of my users: UKPLab/sentence-transformers#1795

Reproduction

import random
import string
import time
import psutil
from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained('xlm-roberta-base')

def random_string(length: int) -> str:
    return ''.join(random.choices(string.ascii_uppercase + string.digits, k=length))

for iteration in range(99999999):
    start_t = time.time()
    tokenizer.encode_batch([random_string(12345) for _ in range(200)])
    memory_usage_in_MiB = psutil.Process().memory_info().rss / (1024 * 1024)
    delta_t = time.time() - start_t
    print(f"{iteration:02d}: {memory_usage_in_MiB:.2f}MB, {delta_t:.2f}s")

Outputs

00: 353.12MB, 0.35s
01: 421.64MB, 0.51s
02: 492.77MB, 0.68s
03: 571.88MB, 0.93s
04: 623.66MB, 1.02s
05: 710.28MB, 1.35s
06: 803.41MB, 1.31s
07: 859.77MB, 1.43s
08: 912.55MB, 1.69s
09: 1014.13MB, 1.78s
10: 1081.04MB, 1.95s
11: 1133.04MB, 2.29s
12: 1208.43MB, 2.56s
13: 1413.81MB, 2.65s
14: 1495.07MB, 2.83s
15: 1575.66MB, 3.00s
16: 1646.78MB, 3.19s
17: 1720.24MB, 3.57s
18: 1793.95MB, 3.82s
19: 1862.75MB, 4.02s
20: 1939.91MB, 4.21s
21: 2008.09MB, 4.71s
22: 2084.01MB, 5.04s
23: 2157.63MB, 5.26s
24: 2228.05MB, 5.56s
25: 2304.84MB, 6.13s
26: 2374.40MB, 6.50s
27: 2445.36MB, 6.68s
28: 2517.31MB, 7.38s
29: 2590.93MB, 7.91s
30: 2432.09MB, 8.19s
31: 2645.64MB, 8.56s
32: 2720.85MB, 8.81s
33: 2801.12MB, 9.73s
34: 2874.08MB, 10.14s
35: 2949.19MB, 11.18s
36: 3017.41MB, 11.28s
37: 3094.99MB, 12.76s
38: 3164.58MB, 14.09s
39: 3232.37MB, 13.26s
40: 3309.48MB, 15.10s

This is rather severe, not just a massive growth in memory usage, but the tokenization speed is also much, much lower.

Notes

The memory usage is much more reasonable if the strings:

are not arbitrary, e.g. repeated "abc"
contain spaces, e.g. by adding + " " to the list of choices.

@n1t0 @Narsil @ArthurZucker

Tom Aarsen

ArthurZucker · 2024-06-21T08:21:20Z

I will check this might be related to FFI (Foreign Function Interface) and the way string are passed to rust in the background.

SilasMarvin · 2024-06-26T18:22:35Z

+1 on facing this issue. Happy to help in any way to get this fixed!

kczimm · 2024-06-26T18:58:13Z

FWIW, it appears to leak even if TOKENIZERS_PARALLELISM=0.

ArthurZucker · 2024-07-12T10:29:59Z

ah then it might be the interface between rust and python

ArthurZucker · 2024-07-12T10:30:09Z

https://dora-rs.ai/blog/rust-python/ I'll try to follow that

ArthurZucker · 2024-07-12T10:30:23Z

if anyone has a fix feel free to open a PR!

gau-nernst · 2024-08-07T12:31:22Z

I'm also getting memory leak. Memory keeps growing until the program crashes. I think this should be a high priority bug to be fixed.

ArthurZucker · 2024-08-07T13:33:33Z

Will investigate, do you have a reproducer as well? Can help figuring out the extent of the bug

gau-nernst · 2024-08-07T14:13:50Z

Using a concrete dataset

from transformers import AutoTokenizer
from huggingface_hub import hf_hub_download


tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

filepath = hf_hub_download("roneneldan/TinyStories", "TinyStoriesV2-GPT4-train.txt", repo_type="dataset")
stories = open(filepath).read().split("\n<|endoftext|>\n")
print(len(stories))  # 2,717,700

outputs = []

chunk_size = 10_000
for i in range(0, len(stories), chunk_size):
    chunk = stories[i : min(i + chunk_size, len(stories))]

    # memory increases 1GB every 2-3s
    outputs.append(tokenizer(chunk, return_attention_mask=False))

    # memory increases at a much slower rate, but might still be abnormal
    # outputs.append(tokenizer(chunk, return_attention_mask=False).input_ids)

The final data is 587,316,317 tokens, which can be fit in memory (using int64, it is ~4GB). In the end I switched to sentencepiece to tokenize data instead.

CSEEduanyu · 2024-08-23T08:18:05Z

Using the tokenizer version rust0.19.1 also has the same problem, is there any progress? @ArthurZucker

ArthurZucker · 2024-08-23T19:51:34Z

I would recommend you to first test with https://github.com/huggingface/tokenizers/releases (0.20.0) but will investigate.

gau-nernst · 2024-08-23T23:52:32Z

I can confirm my snippet above still has memory leak for tokenizers=0.20.0

tomaarsen mentioned this issue Jun 18, 2024

Memory leak in SentenceTransformer.encode during the first ~10000 predictions UKPLab/sentence-transformers#1795

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak for large strings #1539

Memory leak for large strings #1539

noamgai21 commented May 23, 2024 •

edited by ArthurZucker

Loading

noamgai21 commented May 23, 2024

tomaarsen commented Jun 18, 2024 •

edited

Loading

ArthurZucker commented Jun 21, 2024

SilasMarvin commented Jun 26, 2024

kczimm commented Jun 26, 2024

ArthurZucker commented Jul 12, 2024

ArthurZucker commented Jul 12, 2024

ArthurZucker commented Jul 12, 2024

gau-nernst commented Aug 7, 2024

ArthurZucker commented Aug 7, 2024

gau-nernst commented Aug 7, 2024

CSEEduanyu commented Aug 23, 2024

ArthurZucker commented Aug 23, 2024 •

edited

Loading

gau-nernst commented Aug 23, 2024

Memory leak for large strings #1539

Memory leak for large strings #1539

Comments

noamgai21 commented May 23, 2024 • edited by ArthurZucker Loading

noamgai21 commented May 23, 2024

tomaarsen commented Jun 18, 2024 • edited Loading

Reproduction

Outputs

Notes

ArthurZucker commented Jun 21, 2024

SilasMarvin commented Jun 26, 2024

kczimm commented Jun 26, 2024

ArthurZucker commented Jul 12, 2024

ArthurZucker commented Jul 12, 2024

ArthurZucker commented Jul 12, 2024

gau-nernst commented Aug 7, 2024

ArthurZucker commented Aug 7, 2024

gau-nernst commented Aug 7, 2024

CSEEduanyu commented Aug 23, 2024

ArthurZucker commented Aug 23, 2024 • edited Loading

gau-nernst commented Aug 23, 2024

noamgai21 commented May 23, 2024 •

edited by ArthurZucker

Loading

tomaarsen commented Jun 18, 2024 •

edited

Loading

ArthurZucker commented Aug 23, 2024 •

edited

Loading