Find a way to avoid vector duplication #65

Dhravya · 2024-06-17T05:11:35Z

This code is mainly here

supermemory/apps/cf-ai-backend/src/helper.ts

Line 82 in 5af20f7

export async function batchCreateChunksAndEmbeddings({

Currently, we are storing a new vector for each user, in this way:
User -> saves google.com -> first, old stuff is deleted -> vector id is google.com/#supermemory-${userid} (which is then reduced to less than 63 bytes using seededRandom()) -> The page content is chunked and each chunk is saved with it's own ID to the KV. a user: userID metadata is added for retrieval.
User 2 -> saves google.com -> duplicate is made and same after that.

Instead, we want to do it in this way, so that there's no duplicates:
User 1 -> saves google.com -> vector id is google.com (which is reduced to less than 63 bytes using seededrandom()) -> Page content is chunked and each chunk is saved with vectorid-chunkid to the KV. THIS TIME, a metadata with user_${userid}: 1 should be added.

User 2 -> we use the vectorize index to get it by ID
let ids = ["11", "22", "33", "44"]; // all chunk Ids, that we can get by doing list with prefix: seededRandom(url)
const vectors = await env.VECTORIZE_INDEX.getByIds(ids);
If found, simply do env.vectorize_index.upsert with the same vector BUT this time with updated metadata (for each chunk), with the user_${userid2}: 1 added to the json. We don't need to add new documents again. I think we don't even need to update the KV again (since KV is just a lookup of our seededRandom for Url)
NOTE: we also need to change the space adding logic, by adding new keys for each space. like space-userid-spaceid: 1 format.

We would also need to update the retrieval logic in /api/chat which will mostly be the same except this time the filter for spaceid and userid is different.

The text was updated successfully, but these errors were encountered:

Dhravya self-assigned this Jun 17, 2024

Dhravya mentioned this issue Jun 18, 2024

Vector deduplication #71

Merged

Dhravya closed this as completed by moving to Done in Supermemory roadmap Jun 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Find a way to avoid vector duplication #65

Find a way to avoid vector duplication #65

Dhravya commented Jun 17, 2024 •

edited

Loading

Find a way to avoid vector duplication #65

Find a way to avoid vector duplication #65

Comments

Dhravya commented Jun 17, 2024 • edited Loading

Dhravya commented Jun 17, 2024 •

edited

Loading