Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find a way to avoid vector duplication #65

Closed
Dhravya opened this issue Jun 17, 2024 · 0 comments
Closed

Find a way to avoid vector duplication #65

Dhravya opened this issue Jun 17, 2024 · 0 comments
Assignees

Comments

@Dhravya
Copy link
Owner

Dhravya commented Jun 17, 2024

This code is mainly here

export async function batchCreateChunksAndEmbeddings({

Currently, we are storing a new vector for each user, in this way:
User -> saves google.com -> first, old stuff is deleted -> vector id is google.com/#supermemory-${userid} (which is then reduced to less than 63 bytes using seededRandom()) -> The page content is chunked and each chunk is saved with it's own ID to the KV. a user: userID metadata is added for retrieval.
User 2 -> saves google.com -> duplicate is made and same after that.

Instead, we want to do it in this way, so that there's no duplicates:
User 1 -> saves google.com -> vector id is google.com (which is reduced to less than 63 bytes using seededrandom()) -> Page content is chunked and each chunk is saved with vectorid-chunkid to the KV. THIS TIME, a metadata with user_${userid}: 1 should be added.

User 2 -> we use the vectorize index to get it by ID
let ids = ["11", "22", "33", "44"]; // all chunk Ids, that we can get by doing list with prefix: seededRandom(url)
const vectors = await env.VECTORIZE_INDEX.getByIds(ids);
If found, simply do env.vectorize_index.upsert with the same vector BUT this time with updated metadata (for each chunk), with the user_${userid2}: 1 added to the json. We don't need to add new documents again. I think we don't even need to update the KV again (since KV is just a lookup of our seededRandom for Url)
NOTE: we also need to change the space adding logic, by adding new keys for each space. like space-userid-spaceid: 1 format.

We would also need to update the retrieval logic in /api/chat which will mostly be the same except this time the filter for spaceid and userid is different.

@Dhravya Dhravya self-assigned this Jun 17, 2024
@Dhravya Dhravya closed this as completed by moving to Done in Supermemory roadmap Jun 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

1 participant