[FEA] Inmemory cupy Arrays for TF sentencece_encoder #8750

mlahir1 · 2021-07-15T19:53:10Z

For sending data into sentencece_encoder the cudf series needs to converted to a series on Host. which is in-effecient,
Needs to have a method, to convert these string arrays to a cupy array or some other format that can be directly loaded into TF sentencece_encoder.

example:

def get_universal_sentencece_encoder_model():
    module_url = "https://tfhub.dev/google/universal-sentence-encoder/4" 
    model = hub.load(module_url)
    return model
sentence_model = get_universal_sentencece_encoder_model()
sentence_model(sentence_list)

sentence_list = df.a.values_host

The text was updated successfully, but these errors were encountered:

beckernick · 2021-07-19T14:48:29Z

Hi @mlahir1 , thanks for filing an issue. Is the behavior you'd like here for a specific Tensorflow model to accept a GPU array of strings, rather than an array of strings on the CPU?

mlahir1 · 2021-07-19T19:16:56Z

yes, thats right @beckernick . Its not for a specific model, any model that takes in CPU rather than going to forth to host memory. have it in the GPU memory.
@VibhuJawa can elaborate on this.

VibhuJawa · 2021-07-19T22:08:36Z

@mlahir1 . Thanks for raising the issue.

I went down the rabbit hole to figure out how we can enable this. The natural place to enable this will be in tokenizers (ie. converting text to numeric tensors), which can directly be input into tensorflow model .

Sadly there does not seem to be a straightforward way for a user to separate out tokenization from the model with Tensorflow. There are some work-arounds people use but i dont think these work for the Universal Sentence Encoder Model. (They only work for Multi-lingual Universal Sentence Encoder Model).

I have raised a question about it here .

Related Issues :

Possible Workaround:

The other alternate work-around may be to use an equivalent model in Pytorch/HuggingFace and we use something like https://huggingface.co/johngiorgi/declutr-base .

The tokenizer used in this model might not be too hard for us to implement using cudf if we really need something like this.

github-actions · 2021-11-15T21:03:08Z

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

vyasr · 2024-05-24T16:39:52Z

With the work going into #14926 we will soon expose a path to give users direct views of our data as host or device arrow arrays. Since arrow is a standardized interchange format, that will be the right approach for this going forward. cupy arrays aren't the right choice here because cupy doesn't support strings. TF already supports loading arrow data, so if this request arises again the right thing to do is to make sure the dataset loader can handle arrow device data.

mlahir1 added Needs Triage Need team to review and classify feature request New feature or request labels Jul 15, 2021

beckernick added question Further information is requested and removed Needs Triage Need team to review and classify labels Jul 19, 2021

beckernick added this to the Tabular Data for Deep Learning milestone Jul 23, 2021

github-actions bot added the inactive-90d label Nov 15, 2021

GregoryKimball added Python Affects Python cuDF API. and removed inactive-90d labels Nov 21, 2022

vyasr closed this as completed May 24, 2024

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Inmemory cupy Arrays for TF sentencece_encoder #8750

[FEA] Inmemory cupy Arrays for TF sentencece_encoder #8750

mlahir1 commented Jul 15, 2021 •

edited

Loading

beckernick commented Jul 19, 2021

mlahir1 commented Jul 19, 2021

VibhuJawa commented Jul 19, 2021 •

edited

Loading

github-actions bot commented Nov 15, 2021

vyasr commented May 24, 2024

[FEA] Inmemory cupy Arrays for TF sentencece_encoder #8750

[FEA] Inmemory cupy Arrays for TF sentencece_encoder #8750

Comments

mlahir1 commented Jul 15, 2021 • edited Loading

beckernick commented Jul 19, 2021

mlahir1 commented Jul 19, 2021

VibhuJawa commented Jul 19, 2021 • edited Loading

Possible Workaround:

github-actions bot commented Nov 15, 2021

vyasr commented May 24, 2024

mlahir1 commented Jul 15, 2021 •

edited

Loading

VibhuJawa commented Jul 19, 2021 •

edited

Loading