-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support vector ANN search benchmarking #3094
base: bb-11.4-vec-vicentiu
Are you sure you want to change the base?
Support vector ANN search benchmarking #3094
Conversation
|
Test resultsExample for a local run with
|
ac8d6d9
to
7a3507a
Compare
38fba8a
to
663c971
Compare
663c971
to
ae69733
Compare
2f6b2b3
to
3d0e4ea
Compare
746db91
to
205a394
Compare
Introduce scripts and Dockerfile for executing the `ann-benchmarks` tool, aimed at vector search performance testing. Support running the ANN benchmarking both in GitLab CI and manually. Developer Interface: Both of the scripts provide flexibility for altering default behavior via environment variables. Refer to the detailed description in the scripts' documentation section. - `run-local.sh`: This script facilitates the execution of the ANN (Approximate Nearest Neighbors) benchmarking test either against local builds or a specified folder where the MariaDB server is installed. - `run-docker.sh`: This script automates the execution of the ANN benchmarking test within a Docker container. It builds the required Docker image if it doesn't exist or if forced, then build the source code and runs the benchmark in the specified workspace. GitLab CI Build: - A new job `ann-benchmark` is included in the test stage. This job runs ann-benchmark against the MariaDB server built in Ubuntu 22.04. Initially, we are using the `random-xs-20-euclidean` dataset with 20 dimensions and 10000 records.
205a394
to
500403f
Compare
@HugoWenTD , could you add support for diff --git a/ann_benchmarks/algorithms/mariadb/module.py b/ann_benchmarks/algorithms/mariadb/module.py
index 382ea70..89efce1 100644
--- a/ann_benchmarks/algorithms/mariadb/module.py
+++ b/ann_benchmarks/algorithms/mariadb/module.py
@@ -8,6 +8,7 @@ import subprocess
import sys
import tempfile
import time
+import threading
import mariadb
@@ -25,7 +26,7 @@ class MariaDB(BaseANN):
self._test_time = time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())
self._metric = metric
self._m = method_param['M']
- self._cur = None
+ self._ = threading.local()
self._perf_proc = None
self._perf_records = []
self._perf_stats = []
@@ -45,7 +46,7 @@ class MariaDB(BaseANN):
# Connect to MariaDB using Unix socket
conn = mariadb.connect(unix_socket=self._socket_file)
- self._cur = conn.cursor()
+ self._.cur = conn.cursor()
def prepare_options(self):
self._perf_stat = os.environ.get('PERF', 'no') == 'yes' and MariaDB.can_run_perf()
@@ -247,15 +248,15 @@ class MariaDB(BaseANN):
def fit(self, X):
# Prepare database and table
print("\nPreparing database and table...")
- self._cur.execute("DROP DATABASE IF EXISTS ann")
- self._cur.execute("CREATE DATABASE ann")
- self._cur.execute("USE ann")
- self._cur.execute("SET mhnsw_max_edges_per_node = %d" % self._m)
- self._cur.execute("SET rand_seed1=1, rand_seed2=2")
+ self._.cur.execute("DROP DATABASE IF EXISTS ann")
+ self._.cur.execute("CREATE DATABASE ann")
+ self._.cur.execute("USE ann")
+ self._.cur.execute("SET mhnsw_max_edges_per_node = %d" % self._m)
+ self._.cur.execute("SET rand_seed1=1, rand_seed2=2")
# Innodb create table with index is not supported with the latest commit of the develop branch.
# Once all supported we could use:
- #self._cur.execute("CREATE TABLE t1 (id INT PRIMARY KEY, v BLOB NOT NULL, vector INDEX (v)) ENGINE=InnoDB;")
- self._cur.execute("CREATE TABLE t1 (id INT PRIMARY KEY, v BLOB NOT NULL, vector INDEX (v)) ENGINE=MyISAM;")
+ #self._.cur.execute("CREATE TABLE t1 (id INT PRIMARY KEY, v BLOB NOT NULL, vector INDEX (v)) ENGINE=InnoDB;")
+ self._.cur.execute("CREATE TABLE t1 (id INT PRIMARY KEY, v BLOB NOT NULL, vector INDEX (v)) ENGINE=MyISAM;")
# Insert data
print("\nInserting data...")
@@ -263,11 +264,11 @@ class MariaDB(BaseANN):
start_time = time.time()
rps = 10000
for i, embedding in enumerate(X):
- self._cur.execute("INSERT INTO t1 (id, v) VALUES (%d, %s)", (i, bytes(vector_to_hex(embedding))))
+ self._.cur.execute("INSERT INTO t1 (id, v) VALUES (%d, %s)", (i, bytes(vector_to_hex(embedding))))
if i % int(rps + 1) == 1:
rps=i/(time.time()-start_time)
print(f"{i:6d} of {len(X)}, {rps:4.2f} stmt/sec, ETA {(len(X)-i)/rps:.0f} sec")
- self._cur.execute("commit")
+ self._.cur.execute("commit")
self.perf_stop()
print(f"\nInsert time for {X.size} records: {time.time() - start_time:7.2f}")
@@ -280,7 +281,7 @@ class MariaDB(BaseANN):
elif self._metric == "euclidean":
# The feature is being developed
# Currently stack will be empty for indexing in perf data as nothing is executed
- #self._cur.execute("ALTER TABLE `t1` ADD VECTOR INDEX (v);")
+ #self._.cur.execute("ALTER TABLE `t1` ADD VECTOR INDEX (v);")
pass
else:
pass
@@ -292,25 +293,32 @@ class MariaDB(BaseANN):
def set_query_arguments(self, ef_search):
# Set ef_search
self._ef_search = ef_search
- self._cur.execute("SET mhnsw_limit_multiplier = %d/10" % ef_search)
+ self._.cur.execute("SET mhnsw_limit_multiplier = %d/10" % ef_search)
def query(self, v, n):
- self._cur.execute("SELECT id FROM t1 ORDER by vec_distance(v, %s) LIMIT %d", (bytes(vector_to_hex(v)), n))
- return [id for id, in self._cur.fetchall()]
+ if not hasattr(self._, 'cur'):
+ conn = mariadb.connect(unix_socket=self._socket_file)
+ self._.cur = conn.cursor()
+ self._.cur.execute("USE ann")
+ self._.cur.execute("SET mhnsw_limit_multiplier = %d/10" % self._ef_search)
+ self._.cur.execute("SET rand_seed1=13, rand_seed2=29")
+
+ self._.cur.execute("SELECT id FROM t1 ORDER by vec_distance(v, %s) LIMIT %d", (bytes(vector_to_hex(v)), n))
+ return [id for id, in self._.cur.fetchall()]
# TODO for MariaDB, get the memory usage when index is supported:
# def get_memory_usage(self):
- # if self._cur is None:
+ # if self._.cur is None:
# return 0
- # self._cur.execute("")
- # return self._cur.fetchone()[0] / 1024
+ # self._.cur.execute("")
+ # return self._.cur.fetchone()[0] / 1024
def __str__(self):
return f"MariaDB(m={self._m:2d}, ef_search={self._ef_search})"
def done(self):
# Shutdown MariaDB server when benchmarking done
- self._cur.execute("shutdown")
+ self._.cur.execute("shutdown")
# Stop perf for searching and do final analysis
self.perf_stop()
self.perf_analysis() that works, but if you run |
I've managed to make it work with |
@vuvova I think might be related to the ann-benchmark framework, I'll investigate it further once I have some time from other tasks. |
Description
Introduce scripts and Docker file for running the
ann-benchmarks
tool, dedicated to vector search performance testing.Offer developers support to run the benchmark in their development environment via existing MariaDB builds or by deploying the source code and executing the benchmark within Docker.
Also, integrate these builds into GitLab CI for Ubuntu 22.04 and include ANN benchmarking tests.
For detailed usage instructions, refer to the commit message and script help command.
How can this PR be tested?
Manual test was done for the scripts. The script is also integrated in Git-Lab CI pipeline.
Basing the PR against the correct MariaDB version
Backward compatibility
The changes fully backward compatible.
Copyright
All new code of the whole pull request, including one or several files that are either new files or modified ones, are contributed under the BSD-new license. I am contributing on behalf of my employer Amazon Web Services, Inc.