-
Notifications
You must be signed in to change notification settings - Fork 884
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Inmemory cupy Arrays for TF sentencece_encoder #8750
Comments
Hi @mlahir1 , thanks for filing an issue. Is the behavior you'd like here for a specific Tensorflow model to accept a GPU array of strings, rather than an array of strings on the CPU? |
yes, thats right @beckernick . Its not for a specific model, any model that takes in CPU rather than going to forth to host memory. have it in the GPU memory. |
@mlahir1 . Thanks for raising the issue. I went down the rabbit hole to figure out how we can enable this. The natural place to enable this will be in tokenizers (ie. converting text to numeric tensors), which can directly be input into tensorflow model . Sadly there does not seem to be a straightforward way for a user to separate out tokenization from the model with Tensorflow. There are some work-arounds people use but i dont think these work for the Universal Sentence Encoder Model. (They only work for Multi-lingual Universal Sentence Encoder Model). I have raised a question about it here . Related Issues :
Possible Workaround:The other alternate work-around may be to use an equivalent model in Pytorch/HuggingFace and we use something like https://huggingface.co/johngiorgi/declutr-base . The tokenizer used in this model might not be too hard for us to implement using cudf if we really need something like this. |
This issue has been labeled |
With the work going into #14926 we will soon expose a path to give users direct views of our data as host or device arrow arrays. Since arrow is a standardized interchange format, that will be the right approach for this going forward. cupy arrays aren't the right choice here because cupy doesn't support strings. TF already supports loading arrow data, so if this request arises again the right thing to do is to make sure the dataset loader can handle arrow device data. |
For sending data into sentencece_encoder the cudf series needs to converted to a series on Host. which is in-effecient,
Needs to have a method, to convert these string arrays to a cupy array or some other format that can be directly loaded into TF sentencece_encoder.
example:
The text was updated successfully, but these errors were encountered: