Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: Getting ValueError Invalid literal for base 10 while vectorizing #14535

Open
1 task done
SiddharthMurjani opened this issue Jul 3, 2024 · 3 comments
Open
1 task done
Labels
question Further information is requested stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed

Comments

@SiddharthMurjani
Copy link

SiddharthMurjani commented Jul 3, 2024

Question Validation

  • I have searched both the documentation and discord for an answer.

Question

I am vectorizing using the below code

embed_model = llm_class(api_key=os.getenv("OPENAI_API_KEY"), deployment=model_type, model=model_type, **kwargs)
Settings.embed_model = embed_model

def save_llama(content, metadata, storage_context):
    document = Document(text=content, metadata=metadata)
    VectorStoreIndex.from_documents(
        documents=[document],
        storage_context=storage_context,
        show_progress=True
    )
@SiddharthMurjani SiddharthMurjani added the question Further information is requested label Jul 3, 2024
Copy link

dosubot bot commented Jul 3, 2024

To resolve the ValueError: Invalid literal for base 10 error while vectorizing, ensure that the vector store is properly initialized and passed to the query. The error message indicates that a vector store is required for the vector store query.

Here is the corrected code snippet:

embed_model = llm_class(api_key=os.getenv("OPENAI_API_KEY"), deployment=model_type, model=model_type, **kwargs)
Settings.embed_model = embed_model

def save_llama(content, metadata, storage_context):
    document = Document(text=content, metadata=metadata)
    
    # Ensure the vector store is properly loaded or initialized
    vector_store = SomeVectorStore.load_from_disk('vector_store.json')  # Replace with actual vector store loading method
    
    VectorStoreIndex.from_documents(
        documents=[document],
        storage_context=storage_context,
        vector_store=vector_store,  # Pass the vector store to the query
        show_progress=True
    )

Make sure to replace SomeVectorStore.load_from_disk('vector_store.json') with the actual method to load or initialize your vector store. This should resolve the ValueError by ensuring that the vector store is provided for the vector store query [1].

To continue talking to Dosu, mention @dosu.

@SiddharthMurjani
Copy link
Author

After debuggin I got that no index_struct is being returned.

        with self._callback_manager.as_trace("index_construction"):
            if index_struct is None:
                nodes = nodes or []
                index_struct = self.build_index_from_nodes(
                    nodes + objects  # type: ignore
                )

And that's why VectoreStoreIndex is not being initialized

class VectorStoreIndex(BaseIndex[IndexDict]):
    """Vector Store Index.

    Args:
        use_async (bool): Whether to use asynchronous calls. Defaults to False.
        show_progress (bool): Whether to show tqdm progress bars. Defaults to False.
        store_nodes_override (bool): set to True to always store Node objects in index
            store and document store even if vector store keeps text. Defaults to False
    """

    index_struct_cls = IndexDict

    def __init__(
        self,
        nodes: Optional[Sequence[BaseNode]] = None,
        # vector store index params
        use_async: bool = False,
        store_nodes_override: bool = False,
        embed_model: Optional[EmbedType] = None,
        insert_batch_size: int = 2048,
        # parent class params
        objects: Optional[Sequence[IndexNode]] = None,
        index_struct: Optional[IndexDict] = None,
        storage_context: Optional[StorageContext] = None,
        callback_manager: Optional[CallbackManager] = None,
        transformations: Optional[List[TransformComponent]] = None,
        show_progress: bool = False,
        # deprecated
        service_context: Optional[ServiceContext] = None,
        **kwargs: Any,
    ) -> None:
        """Initialize params."""
        self._use_async = use_async
        self._store_nodes_override = store_nodes_override
        self._embed_model = (
            resolve_embed_model(embed_model, callback_manager=callback_manager)
            if embed_model
            else embed_model_from_settings_or_context(Settings, service_context)
        )

        self._insert_batch_size = insert_batch_size
        super().__init__(
            nodes=nodes,
            **index_struct=index_struct,**
            service_context=service_context,
            storage_context=storage_context,
            show_progress=show_progress,
            objects=objects,
            callback_manager=callback_manager,
            transformations=transformations,
            **kwargs,
        )

@logan-markewich
Copy link
Collaborator

I don't think this is related to anything with VectoreStoreIndex -- I think your embedding model is not returning raw floats for embedding values.

I don't know what llm_class is doing, but that is likely the issue.

You can easily check with embeddings = embed_model.get_text_embedding("Hello world") and ensure the returned type is a list of float. Seems like it might be returning numpy or something else

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Oct 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed
Projects
None yet
Development

No branches or pull requests

2 participants