🌲 KD-Tree: Spatial Indexing for Efficient Data Retrieval

📰 Introduction

This project objective is to have Sematic Search Database System that can create indexing on large amount of data, besides it is able to search in the created index for query vectors in a fast and accurate method.

📋Project description

The project focuses on developing an indexing system that efficiently stores and retrieves data utilizing vector space embeddings. Key requirements include:

Two-column Database Structure:
- Columns: ID, Embedding (vector of dim 70).
Top-k Similarity Retrieval:
- Retrieve the top-k most similar rows to a given input.
Scalability:
- Handle large datasets, up to 20 million records.
Efficient Query Retrieval:
- Quickly retrieve the top 'k' results for a given query (k up to 10).
Responsive Performance:
- Ensure the system responds within a reasonable time.

👀 Overview on the Indexing idea with KD-tree and implementation

Build Operation:

Disk Usage:
- Large vector operations on disk due to memory constraints.
- Memory-constrained operations in-memory.
Clustering:
- Nodes clustered on disk.
- Global directory ("output") as the first level of indexing.
Algorithm:
- External sorting for large datasets.
- Recursive partitioning and sorting.
- Switch to internal build for in-memory partitions.
- Cluster creation at a size threshold.
- Nodes stored in both internal and leaf nodes.

Search Operation:

Global Directory Search:
- Binary search algorithm on the global directory.
- Multiple clusters may be required.
Binary Search-Like:
- Compares specific dimensions at each tree level.
- May result in both right and left nodes being applicable.
Cluster Loading:
- One cluster loaded at a time.
- Released when not needed.

General Notes:

Insertion:
- Inserted nodes find correct position in the cluster.
Thresholds:
- Cluster size threshold determined for memory efficiency.
Memory Management:
- Efficient loading of necessary clusters.

🚀 results

Evaluation note book Here

Reference

Project description in more details

contributors

mohamed Tarek
mohamed elsayed
salah abotaleb
moaz tarek
ahmed yasser

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
docs		docs
.gitignore		.gitignore
Team_1_kd_tree_Evaluation_Notebook.ipynb		Team_1_kd_tree_Evaluation_Notebook.ipynb
kdtree.py		kdtree.py
readme.md		readme.md
requirements.txt		requirements.txt
savingTest.py		savingTest.py
vec_db.py		vec_db.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🌲 KD-Tree: Spatial Indexing for Efficient Data Retrieval

📰 Introduction

📋Project description

👀 Overview on the Indexing idea with KD-tree and implementation

Build Operation:

Search Operation:

General Notes:

🚀 results

Reference

contributors

About

Releases

Packages

Contributors 2

Languages

mtheggi/sematic_search_DB

Folders and files

Latest commit

History

Repository files navigation

🌲 KD-Tree: Spatial Indexing for Efficient Data Retrieval

📰 Introduction

📋Project description

👀 Overview on the Indexing idea with KD-tree and implementation

Build Operation:

Search Operation:

General Notes:

🚀 results

Reference

contributors

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages