Add embedding task #224

markstur · 2023-10-04T03:13:43Z

Adds an embedding-retrieval task to get embedding vector for a sentence.

Return a list of floats. For one string (a.k.a. sentence).
Support for multiple strings removed for future batch support.
Uses sentence-transformers.
The data model allows for different float types (py, np.float32, np.float64).

Data model is based on Gabe's feedback in #39

This embedding service will be extended in separate PRs. With multi-task support. This service can also support sentence-similarity and reranking.

markstur · 2023-10-04T03:21:13Z

Tag @gabe-l-hart and will tag Anjali if I can find her.

caikit_nlp/modules/embedding_retrieval/embedding.py

pyproject.toml

caikit_nlp/data_model/embedding_vectors.py

caikit_nlp/modules/embedding_retrieval/embedding.py

Signed-off-by: gkumbhat <[email protected]>

Return a list of vectors. One for each input sentence. Uses sentence-transformers. The data model allows for different float types (py, np.float32, np.float64). Signed-off-by: markstur <[email protected]> Co-authored-by: gkumbhat <[email protected]>

Signed-off-by: markstur <[email protected]>

Need more approx() wrappers to pass CI. Signed-off-by: markstur <[email protected]>

* More type checks * Ensure to JSON uses consistent keys not varying one-of names * More tests Signed-off-by: markstur <[email protected]>

* Was not safe for use with existing dir or empty path because errors lead to rmtree. * Added checks and tests * Some additional cleanup Signed-off-by: markstur <[email protected]>

* One sentence in, one vector out * Use bootstrap/save to create a model config with model artifacts * Simplified: * Removed the support for sentences/vectors * Removed the hf_model download Signed-off-by: markstur <[email protected]>

Signed-off-by: markstur <[email protected]>

markstur · 2023-10-19T06:01:27Z

rebased to catch up to main (and so I can stack PRs with multi-task support)

gkumbhat · 2023-10-23T16:44:56Z

caikit_nlp/data_model/embedding_vectors.py

+
+@dataobject(package="caikit_data_model.caikit_nlp")
+@dataclass
+class Vector1D(DataObjectBase):


nit: we usually try to name the output object name conveying the "task" related output. Since this is directly output of EmbeddingTask, can we rename this to EmbeddingVector ?

We may also want to use the Vector1D later on.. so may be it would make sense to go with EmbeddingResponse with Vector1D used in it.

Changed to EmbeddingResponse with Vector1D in it.

My opinion? I tend to dislike these extra levels like resp.result.data.values[2] when resp[2] would've been much nicer considering we really just want a List[float] (more or less). But I guess it looks pretty good this new way.

gkumbhat · 2023-10-23T16:45:59Z

caikit_nlp/data_model/embedding_vectors.py

+ )
+
+ @classmethod
+ def from_embeddings(cls, embeddings):


nit: Can we rename this function to from_vector.

gkumbhat · 2023-10-23T16:48:55Z

caikit_nlp/modules/embedding_retrieval/embedding.py

+ f"model_path '{model_config_path}' is invalid",
+ )
+
+ model_config_path = os.path.abspath(


why do we need absolute path here?

An example of how we use ModuleSaver: https://github.com/caikit/caikit-nlp/blob/main/caikit_nlp/modules/text_generation/text_generation_local.py#L482

The ModuleSaver docstring says it takes an "absolute path":

model_path (str): The absolute path to the directory where the model will be saved. If this directory does not exist, it will be created.

It doesn't look like that is necessary today, but better to follow the doc.

re: example using ModuleSaver... I'd recommend you do not use the ModuleSaver context manager. It isn't safe. caikit/caikit#525

gkumbhat · 2023-10-23T16:51:39Z

caikit_nlp/modules/embedding_retrieval/embedding.py

+ saver.update_config({self._ARTIFACTS_PATH_KEY: artifacts_path})
+
+ # Save the model
+ artifacts_path = os.path.abspath(


is absolute path required here

Nope. Removed.

evaline-ju

few small things

evaline-ju · 2023-10-23T18:12:56Z

caikit_nlp/modules/tokenization/__init__.py

@@ -12,5 +12,8 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

+# First Party
+from caikit.core import TaskBase, task


there's no task added to this file - was this added for another particular reason?

evaline-ju · 2023-10-23T18:14:46Z

tests/data_model/test_embedding_vectors.py

+
+def test_vector1d_trick():
+ """FYI -- The param check currently allows for objects with values using this trick"""
+ dm.Vector1D(data=TRICK_SEQUENCE(values=[1.1, 2.2]))


is there anything we can assert on the resulting object to make sure that this is not just not-erroring?

I removed this test. It was redundant with below where the results are validated. Less confusing to just get rid of this one.

evaline-ju · 2023-10-23T19:43:56Z

caikit_nlp/modules/embedding_retrieval/embedding.py

+from .embedding_retrieval_task import EmbeddingRetrievalTask
+from caikit_nlp.data_model.embedding_vectors import Vector1D
+
+logger = alog.use_channel("<EMBD_BLK>")


nit: probably don't need the <> or there might be multiple bracket layers https://github.com/IBM/alchemy-logging/blob/main/src/python/README.md#log-contexts

evaline-ju · 2023-10-23T19:49:07Z

tests/modules/embedding_retrieval/test_embedding_retrieval.py

@@ -0,0 +1,96 @@
+"""Tests for sequence classification module


embedding or embedding retrieval module?

also nit: should we rename the file to test_embedding to match the module file?

Fixed comment. Renamed test file to match. Did some other renaming for more consistency.

* Renames for more consistent naming * Adding a level EmbeddingResult -> Vector1D for consistent naming and future use * Remove an abspath() call that isn't needed for model save() * Remove a redundant test (was confusing) Signed-off-by: markstur <[email protected]>

markstur · 2023-11-01T20:53:14Z

@gkumbhat anything I need to do to unblock this one? Not sure if it helps if click the "resolve conversation" buttons. Up to you.

gkumbhat · 2023-11-09T20:13:21Z

caikit_nlp/modules/text_embedding/embedding.py

+
+
+@module(
+ "EEB12558-B4FA-4F34-A9FD-3F5890E9CD3F",


interesting change I noticed that the id here contains all caps (plus numbers).. Other modules have small cases. There isn't a rule enforcing around it, but might be good to be consistent. 🤔

done (in-coming)

gkumbhat

Left some small things, but other than that it looks good. Thanks @markstur

gkumbhat · 2023-11-09T20:15:13Z

caikit_nlp/modules/text_embedding/embedding.py

+ """
+ error.type_check("<NLP27491611E>", str, input=input)
+
+ return EmbeddingResult(Vector1D.from_vector(self.model.encode(input)))


do we need to do some validation on model.encode output before converting it to embedding via vector or is it guaranteed to be a compatible vector always?

Always returns a tensor that works with Vector1D.from_vector(). There is nothing to validate here. If the dtype is something we didn't expect we use PyFloatSequence(). I could add a post_init so PyFloatSequence() evaluates each value, but I don't have a real case for that right now.

gkumbhat · 2023-11-09T20:18:04Z

caikit_nlp/modules/text_embedding/embedding.py

+ )
+
+ # Get and update config (artifacts_path)
+ artifacts_path = saver.config.get(self._ARTIFACTS_PATH_KEY)


this will always be none right? Unless we are overriding the model ?

right, I can remove this

gkumbhat · 2023-11-09T20:18:32Z

caikit_nlp/modules/text_embedding/embedding.py

+
+ # Save the model
+ self.model.save(
+ os.path.join(model_config_path, artifacts_path), create_model_card=True


how does the model card gets saved and consumed?

I'll remove the option, but it'a a model card that is typically stored with the model files documenting what it there.

gkumbhat · 2023-11-09T20:20:40Z

caikit_nlp/modules/text_embedding/embedding.py

+ model_config_path.strip()
+ ) # No leading/trailing spaces sneaky weirdness
+
+ os.makedirs(model_config_path, exist_ok=False)


We can also use saver.add_dirs functions from here: https://github.com/caikit/caikit/blob/a16f063f6155f0088eb9959a32b7f0871e89731d/caikit/core/modules/saver.py#L117

This will avoid need to also make the absolute path etc above.

We can also use the context base use of ModuleSaver and that can take care of adding / making the directory as well. Example: https://github.com/caikit/caikit-nlp/blob/main/caikit_nlp/modules/text_generation/peft_prompt_tuning.py#L472

We can also use saver.add_dirs functions from here: https://github.com/caikit/caikit/blob/a16f063f6155f0088eb9959a32b7f0871e89731d/caikit/core/modules/saver.py#L117

This will avoid need to also make the absolute path etc above.

os.makedirs() is used because in-place updates are not safe to run on existing dirs. I use exist_ok=False to enforce that where ModuleSaver would not and can remove existing file trees on exceptions. Not cool.

The abspath() is per ModuleSaver param docstrings. strip() should not be necessary but seems better than testing what blanks would do for user experience.

Note: the net of these things (and below) is that using save.add_dir() would only add a line of code and not save anything. Otherwise add_dir() would be fine.

We can also use the context base use of ModuleSaver and that can take care of adding / making the directory as well. Example: https://github.com/caikit/caikit-nlp/blob/main/caikit_nlp/modules/text_generation/peft_prompt_tuning.py#L472

context for ModuleSaver is pretty dangerous here. Use save() wrong and ModuleSaver will wipe out a directory that might be important.

but that is related to os.mkdirs only right? So that can be fixed by adding this an option in the context manager or in the caikit repo itself? May be club it with your DM PR in caikit ?

* Removed some unnecessary things in save() * Lowercase module GUID for consistency Signed-off-by: markstur <[email protected]>

markstur · 2023-11-14T00:26:36Z

@gkumbhat Updated based on feedback. I don't want to use some of the saver() code until it is fixed to be safe and even revert partial changes, but we did get some unnecessary stuff out of save(). Thanks!

gkumbhat

LGTM

markstur requested review from alex-jw-brooks, gkumbhat, evaline-ju, gabe-l-hart and tharapalanivel as code owners October 4, 2023 03:13

gkumbhat requested changes Oct 6, 2023

View reviewed changes

markstur requested a review from gkumbhat October 14, 2023 04:51

gkumbhat requested changes Oct 16, 2023

View reviewed changes

gkumbhat mentioned this pull request Oct 16, 2023

Add rerank and sentence-similarity tasks to text embedding module #235

Merged

markstur requested a review from gkumbhat October 17, 2023 22:12

gkumbhat and others added 11 commits October 18, 2023 22:19

✨ Add embedding related data models

9fecb56

Signed-off-by: gkumbhat <[email protected]>

✨ Add embedding retrieval task

2bde952

Signed-off-by: gkumbhat <[email protected]>

✨ Add numpy dtype support for storing floatvalues

e412bb1

Signed-off-by: gkumbhat <[email protected]>

✅ Add test for 2D embedding vector and fix ndim bug

258a986

Signed-off-by: gkumbhat <[email protected]>

Embedding retrieval

dcbdb0e

Return a list of vectors. One for each input sentence. Uses sentence-transformers. The data model allows for different float types (py, np.float32, np.float64). Signed-off-by: markstur <[email protected]> Co-authored-by: gkumbhat <[email protected]>

fmt fixin

a2605d1

Signed-off-by: markstur <[email protected]>

float approx() for test equality

f30752e

Need more approx() wrappers to pass CI. Signed-off-by: markstur <[email protected]>

Embeddings data_model checks and tests

9431dbc

* More type checks * Ensure to JSON uses consistent keys not varying one-of names * More tests Signed-off-by: markstur <[email protected]>

Add safety checks for save()

83b5902

* Was not safe for use with existing dir or empty path because errors lead to rmtree. * Added checks and tests * Some additional cleanup Signed-off-by: markstur <[email protected]>

Single sentence embedding retrieval

b30028c

* One sentence in, one vector out * Use bootstrap/save to create a model config with model artifacts * Simplified: * Removed the support for sentences/vectors * Removed the hf_model download Signed-off-by: markstur <[email protected]>

Tweaks from review feedback

af35dd9

Signed-off-by: markstur <[email protected]>

markstur force-pushed the add_embedding_task branch from 0f760f5 to af35dd9 Compare October 19, 2023 05:57

gkumbhat reviewed Oct 23, 2023

View reviewed changes

evaline-ju reviewed Oct 23, 2023

View reviewed changes

Renames and cleanup

2d61326

* Renames for more consistent naming * Adding a level EmbeddingResult -> Vector1D for consistent naming and future use * Remove an abspath() call that isn't needed for model save() * Remove a redundant test (was confusing) Signed-off-by: markstur <[email protected]>

markstur force-pushed the add_embedding_task branch from 548fd97 to 2d61326 Compare October 24, 2023 08:03

markstur requested review from evaline-ju and gkumbhat October 24, 2023 15:48

gkumbhat reviewed Nov 9, 2023

View reviewed changes

Per-review feedback: save() code is cleaner and GUIDS are guids.

2b8b134

* Removed some unnecessary things in save() * Lowercase module GUID for consistency Signed-off-by: markstur <[email protected]>

markstur requested a review from gkumbhat November 17, 2023 17:50

gkumbhat approved these changes Nov 17, 2023

View reviewed changes

gkumbhat merged commit 316ead6 into caikit:main Nov 20, 2023
5 checks passed



		@module(
		"EEB12558-B4FA-4F34-A9FD-3F5890E9CD3F",

Add embedding task #224

Add embedding task #224

Conversation

markstur commented Oct 4, 2023 • edited Loading

markstur commented Oct 4, 2023

markstur commented Oct 19, 2023

Choose a reason for hiding this comment

gkumbhat Oct 23, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

evaline-ju left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markstur commented Nov 1, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gkumbhat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markstur Nov 13, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markstur commented Nov 14, 2023

gkumbhat left a comment

Choose a reason for hiding this comment

markstur commented Oct 4, 2023 •

edited

Loading

gkumbhat Oct 23, 2023 •

edited

Loading

markstur Nov 13, 2023 •

edited

Loading