Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TensorFlow fails while no TensorFlow expected to run at all #1532

Closed
artkpv opened this issue May 22, 2024 · 1 comment
Closed

TensorFlow fails while no TensorFlow expected to run at all #1532

artkpv opened this issue May 22, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@artkpv
Copy link

artkpv commented May 22, 2024

Describe the bug

I am running OAIEval for the steganography eval with Llama 3 70B using PyTorch, HuggingFace. I don't use any TensorFlow afaik. However, I see some TensorFlow code is running and fails.

To Reproduce

Add the code to run Llama as below in Code Snippents. I see these messages in stdout:

gcc (GCC) 10.2.1 20210130 (Red Hat 10.2.1-11.1.0.1)
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Python 3.11.5
[2024-05-22 11:46:58,821] [registry.py:271] Loading registry from /data/artyom_karpov/rl4steg/lib/evals/evals/registry/evals
[2024-05-22 11:46:59,485] [registry.py:271] Loading registry from /data/artyom_karpov/.evals/evals
[2024-05-22 11:46:59,704] [registry.py:271] Loading registry from /data/artyom_karpov/rl4steg/lib/evals/evals/registry/completion_fns
[2024-05-22 11:46:59,711] [registry.py:271] Loading registry from /data/artyom_karpov/.evals/completion_fns
[2024-05-22 11:46:59,711] [registry.py:271] Loading registry from /data/artyom_karpov/rl4steg/lib/evals/evals/registry/solvers
[2024-05-22 11:46:59,839] [registry.py:271] Loading registry from /data/artyom_karpov/.evals/solvers
/data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
[2024-05-22 11:47:07,329] [modeling.py:989] We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).

Loading checkpoint shards:   0%|          | 0/30 [00:00<?, ?it/s]
Loading checkpoint shards:   3%|▎         | 1/30 [00:03<01:32,  3.20s/it]
Loading checkpoint shards:   7%|▋         | 2/30 [00:06<01:31,  3.27s/it]
Loading checkpoint shards:  10%|█         | 3/30 [00:09<01:28,  3.29s/it]
Loading checkpoint shards:  13%|█▎        | 4/30 [00:13<01:30,  3.49s/it]
Loading checkpoint shards:  17%|█▋        | 5/30 [00:17<01:26,  3.48s/it]
Loading checkpoint shards:  20%|██        | 6/30 [00:20<01:22,  3.43s/it]
Loading checkpoint shards:  23%|██▎       | 7/30 [00:23<01:19,  3.45s/it]
Loading checkpoint shards:  27%|██▋       | 8/30 [00:28<01:22,  3.73s/it]
Loading checkpoint shards:  30%|███       | 9/30 [00:32<01:18,  3.75s/it]
Loading checkpoint shards:  33%|███▎      | 10/30 [00:35<01:13,  3.69s/it]
Loading checkpoint shards:  37%|███▋      | 11/30 [00:39<01:10,  3.71s/it]
Loading checkpoint shards:  40%|████      | 12/30 [00:43<01:10,  3.94s/it]
Loading checkpoint shards:  43%|████▎     | 13/30 [00:48<01:10,  4.12s/it]
Loading checkpoint shards:  47%|████▋     | 14/30 [00:52<01:06,  4.15s/it]
Loading checkpoint shards:  50%|█████     | 15/30 [00:57<01:03,  4.26s/it]
Loading checkpoint shards:  53%|█████▎    | 16/30 [01:01<01:01,  4.39s/it]
Loading checkpoint shards:  57%|█████▋    | 17/30 [01:06<00:57,  4.41s/it]
Loading checkpoint shards:  60%|██████    | 18/30 [01:10<00:53,  4.45s/it]
Loading checkpoint shards:  63%|██████▎   | 19/30 [01:15<00:48,  4.43s/it]
Loading checkpoint shards:  67%|██████▋   | 20/30 [01:19<00:43,  4.36s/it]
Loading checkpoint shards:  70%|███████   | 21/30 [01:23<00:38,  4.28s/it]
Loading checkpoint shards:  73%|███████▎  | 22/30 [01:27<00:33,  4.22s/it]
Loading checkpoint shards:  77%|███████▋  | 23/30 [01:31<00:29,  4.22s/it]
Loading checkpoint shards:  80%|████████  | 24/30 [01:36<00:25,  4.28s/it]
Loading checkpoint shards:  83%|████████▎ | 25/30 [01:40<00:22,  4.43s/it]
Loading checkpoint shards:  87%|████████▋ | 26/30 [01:45<00:17,  4.43s/it]
Loading checkpoint shards:  90%|█████████ | 27/30 [01:49<00:13,  4.35s/it]
Loading checkpoint shards:  93%|█████████▎| 28/30 [01:53<00:08,  4.35s/it]
Loading checkpoint shards:  97%|█████████▋| 29/30 [01:58<00:04,  4.39s/it]
Loading checkpoint shards: 100%|██████████| 30/30 [01:59<00:00,  3.47s/it]
Loading checkpoint shards: 100%|██████████| 30/30 [01:59<00:00,  3.99s/it]
/data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[2024-05-22 11:49:08,341] [oaieval.py:215] �[1;35mRun started: 240522114908HNUG55EE�[0m
2024-05-22 11:49:09.863802: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-05-22 11:49:11.808810: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[2024-05-22 11:49:13,601] [utils.py:145] Note: detected 128 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2024-05-22 11:49:13,602] [utils.py:148] Note: NumExpr detected 128 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
[2024-05-22 11:49:13,602] [utils.py:161] NumExpr defaulting to 8 threads.
2024-05-22 11:49:15.717343: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1928] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 37944 MB memory:  -> device: 0, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:0f:00.0, compute capability: 8.0
[2024-05-22 11:49:24,785] [data.py:94] Fetching /data/artyom_karpov/rl4steg/lib/evals/evals/registry/data/steganography/samples.jsonl
[2024-05-22 11:49:24,792] [eval.py:36] Evaluating 480 samples
[2024-05-22 11:49:24,810] [eval.py:144] Running in threaded mode with 1 threads!

  0%|          | 0/480 [00:00<?, ?it/s]Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


.... 


2024-05-22 11:53:01.290822: W tensorflow/compiler/mlir/tools/kernel_gen/transforms/gpu_kernel_to_blob_pass.cc:190] Failed to compile generated PTX with ptxas. Falling back to compilation by driver.

  0%|          | 1/480 [03:36<28:49:53, 216.69s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
...

And it runs eventually. How to disable tensorflow?

Code snippets

llama.py

from evals.api import CompletionFn, CompletionResult
from evals.prompt.base import CompletionPrompt
from evals.record import record_sampling
import torch
from typing import Optional
from transformers import AutoModelForCausalLM, LlamaForCausalLM, LlamaConfig, AutoTokenizer


class LlamaCompletionResult(CompletionResult):
    def __init__(self, response) -> None:
        self.response = response

    def get_completions(self) -> list[str]:
        return [self.response.strip()]


class LlamaCompletionFn(CompletionFn):
    def __init__(self, llm: str, llm_kwargs: Optional[dict] = None, **kwargs) -> None:
        self._model = AutoModelForCausalLM.from_pretrained(
            llm,
            return_dict=True,
            load_in_8bit=llm_kwargs["load_in_8bit"],
            load_in_4bit=llm_kwargs["load_in_4bit"],
            device_map="auto",
            low_cpu_mem_usage=True,
            attn_implementation="sdpa" if llm_kwargs.get("use_fast_kernels", False) else None,
            torch_dtype=torch.bfloat16
        )
        self._model.eval()

        self._tokenizer = AutoTokenizer.from_pretrained(llm)
        self._tokenizer.pad_token = self._tokenizer.eos_token
        torch.manual_seed(llm_kwargs.get("seed", 42))
        self._gen_kwargs = llm_kwargs['gen_kwargs']

    @torch.no_grad()
    def __call__(self, prompt, **kwargs) -> CompletionResult:
        prompt = self._tokenizer.apply_chat_template(
            prompt, tokenize=False, add_generation_prompt=True
        )
        batch = self._tokenizer(prompt, padding='max_length', truncation=True, max_length=None, return_tensors="pt")
        batch = {k: v.to("cuda") for k, v in batch.items()}
        outputs = self._model.generate(
            **batch,
            **self._gen_kwargs,
        )
        # Take only response:
        outputs = outputs[0][batch['input_ids'][0].size(0):]
        response = self._tokenizer.decode(outputs, skip_special_tokens=True)
        record_sampling(prompt=prompt, sampled=response)
        return LlamaCompletionResult(response)

Register:


llama/3-70b:
  class: evals.completion_fns.llama:LlamaCompletionFn
  args:
    llm: meta-llama/Meta-Llama-3-70B-Instruct
    llm_kwargs:
      load_in_8bit: false
      load_in_4bit: true
      use_fast_kernels: false
      gen_kwargs:
        max_new_tokens: 200
        do_sample: true
        top_p: 1.0
        temperature: 1.0
        min_length: null
        use_cache: false
        top_k: 50
        repetition_penalty: 1.0
        length_penalty: 1



### OS

Linux * 3.10.0-1160.76.1.0.1.el7.x86_64 #1 SMP Wed Aug 10 17:32:14 PDT 2022 x86_64 x86_64 x86_64 GNU/Linux

### Python version

Python 3.11.5

### Library version

3.0.1
@artkpv artkpv added the bug Something isn't working label May 22, 2024
@artkpv
Copy link
Author

artkpv commented May 25, 2024

Removing tensorflow packages from pip (pip uninstall) seems to solve the issue.

@artkpv artkpv closed this as completed May 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant