-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Metadata length (236) is close to chunk size (256). Resulting chunks are less than 50 tokens. Consider increasing the chunk size or decreasing the size of your metadata to avoid this. #13792
Comments
To resolve the issue of receiving a warning about metadata length being close to chunk size when ingesting text chunks into a Qdrant database, you should consider reducing the size of the metadata or increasing the chunk size. This can help prevent the exponential increase in the number of text chunks and excessive ingestion time. Here is a potential approach to address this issue:
Here is an example of how you might adjust the chunk size before ingestion: def split_text_into_chunks(text, chunk_size):
"""Split text into larger chunks of specified size."""
return [text[i:i + chunk_size] for i in range(0, len(text), chunk_size)]
# Example usage
text = "Your long text here..."
chunk_size = 1000 # Increase the chunk size as needed
text_chunks = split_text_into_chunks(text, chunk_size)
# Ingest each chunk into Qdrant
for chunk in text_chunks:
document = Document(
id_=generate_unique_id(), # Ensure each document has a unique ID
text=chunk,
metadata={"your": "metadata"} # Include only essential metadata
)
# Ingest document into Qdrant
qdrant_client.upload_documents(collection_name="your_collection", documents=[document]) Additionally, you can leverage the Here is an example of how to use the from llama_index.vector_stores.qdrant import QdrantVectorStore
import qdrant_client
client = qdrant_client.QdrantClient()
vector_store = QdrantVectorStore(
collection_name="your_collection",
client=client,
batch_size=64, # Adjust batch size as needed
parallel=1, # Adjust parallel processes as needed
max_retries=3 # Adjust max retries as needed
)
# Example usage
text = "Your long text here..."
chunk_size = 1000 # Increase the chunk size as needed
text_chunks = split_text_into_chunks(text, chunk_size)
nodes = [TextNode(id_=generate_unique_id(), text=chunk, metadata={"your": "metadata"}) for chunk in text_chunks]
# Add nodes to the vector store
vector_store.add(nodes) Moreover, you can use the Here is an example of how to use the from llama_index.core.node_parser.text.token import TokenTextSplitter
# Initialize the TokenTextSplitter with appropriate chunk size and overlap
splitter = TokenTextSplitter(chunk_size=1000, chunk_overlap=100)
# Example usage
text = "Your long text here..."
metadata_str = "Your metadata here..."
text_chunks = splitter.split_text_metadata_aware(text, metadata_str)
# Ingest each chunk into Qdrant
for chunk in text_chunks:
document = Document(
id_=generate_unique_id(), # Ensure each document has a unique ID
text=chunk,
metadata={"your": "metadata"} # Include only essential metadata
)
# Ingest document into Qdrant
qdrant_client.upload_documents(collection_name="your_collection", documents=[document]) By reducing the size of each text chunk, minimizing the metadata, and using the
|
either decrease the amount of text you are putting into metadata,
or set the metadata excludes on your input nodes/documents.
Its taking forever beasuse its probably creating a ton of nodes 😓 |
Bug Description
When ingesting the text chunks in to locally hosted qdrant db it throws this warnings message, and number of text chunks obtained increased exponentially. And its taking forever to ingest the data in to vector db.
Version
0.9.19
Steps to Reproduce
create text chunks with 256 token size of and ingesting this data in to vector db throws this message
Relevant Logs/Tracbacks
No response
The text was updated successfully, but these errors were encountered: