Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Inmemory cupy Arrays for TF sentencece_encoder #8750

Closed
mlahir1 opened this issue Jul 15, 2021 · 5 comments
Closed

[FEA] Inmemory cupy Arrays for TF sentencece_encoder #8750

mlahir1 opened this issue Jul 15, 2021 · 5 comments
Labels
feature request New feature or request Python Affects Python cuDF API. question Further information is requested

Comments

@mlahir1
Copy link

mlahir1 commented Jul 15, 2021

For sending data into sentencece_encoder the cudf series needs to converted to a series on Host. which is in-effecient,
Needs to have a method, to convert these string arrays to a cupy array or some other format that can be directly loaded into TF sentencece_encoder.

example:

def get_universal_sentencece_encoder_model():
    module_url = "https://tfhub.dev/google/universal-sentence-encoder/4" 
    model = hub.load(module_url)
    return model
sentence_model = get_universal_sentencece_encoder_model()
sentence_model(sentence_list)

sentence_list = df.a.values_host
@mlahir1 mlahir1 added Needs Triage Need team to review and classify feature request New feature or request labels Jul 15, 2021
@beckernick
Copy link
Member

Hi @mlahir1 , thanks for filing an issue. Is the behavior you'd like here for a specific Tensorflow model to accept a GPU array of strings, rather than an array of strings on the CPU?

@beckernick beckernick added question Further information is requested and removed Needs Triage Need team to review and classify labels Jul 19, 2021
@mlahir1
Copy link
Author

mlahir1 commented Jul 19, 2021

yes, thats right @beckernick . Its not for a specific model, any model that takes in CPU rather than going to forth to host memory. have it in the GPU memory.
@VibhuJawa can elaborate on this.

@VibhuJawa
Copy link
Member

VibhuJawa commented Jul 19, 2021

@mlahir1 . Thanks for raising the issue.

I went down the rabbit hole to figure out how we can enable this. The natural place to enable this will be in tokenizers (ie. converting text to numeric tensors), which can directly be input into tensorflow model .

Sadly there does not seem to be a straightforward way for a user to separate out tokenization from the model with Tensorflow. There are some work-arounds people use but i dont think these work for the Universal Sentence Encoder Model. (They only work for Multi-lingual Universal Sentence Encoder Model).

I have raised a question about it here .

Related Issues :

  1. Extracting the tokenizer from Multilingual Universal Sentence Encoder tensorflow/hub#662
  2. Getting the word-piece tokenized text for USE tensorflow/hub#686 .

Possible Workaround:

The other alternate work-around may be to use an equivalent model in Pytorch/HuggingFace and we use something like https://huggingface.co/johngiorgi/declutr-base .

The tokenizer used in this model might not be too hard for us to implement using cudf if we really need something like this.

@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@GregoryKimball GregoryKimball added Python Affects Python cuDF API. and removed inactive-90d labels Nov 21, 2022
@vyasr
Copy link
Contributor

vyasr commented May 24, 2024

With the work going into #14926 we will soon expose a path to give users direct views of our data as host or device arrow arrays. Since arrow is a standardized interchange format, that will be the right approach for this going forward. cupy arrays aren't the right choice here because cupy doesn't support strings. TF already supports loading arrow data, so if this request arises again the right thing to do is to make sure the dataset loader can handle arrow device data.

@vyasr vyasr closed this as completed May 24, 2024
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request Python Affects Python cuDF API. question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants