Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support vector ANN search benchmarking #3094

Draft
wants to merge 1 commit into
base: bb-11.4-vec-vicentiu
Choose a base branch
from

Conversation

HugoWenTD
Copy link
Contributor

Description

Introduce scripts and Docker file for running the ann-benchmarks tool, dedicated to vector search performance testing.

  • Offer developers support to run the benchmark in their development environment via existing MariaDB builds or by deploying the source code and executing the benchmark within Docker.

  • Also, integrate these builds into GitLab CI for Ubuntu 22.04 and include ANN benchmarking tests.

For detailed usage instructions, refer to the commit message and script help command.

How can this PR be tested?

Manual test was done for the scripts. The script is also integrated in Git-Lab CI pipeline.

Basing the PR against the correct MariaDB version

  • This is a new feature and the PR is based against the latest MariaDB development branch

Backward compatibility

The changes fully backward compatible.

Copyright

All new code of the whole pull request, including one or several files that are either new files or modified ones, are contributed under the BSD-new license. I am contributing on behalf of my employer Amazon Web Services, Inc.

@CLAassistant
Copy link

CLAassistant commented Mar 1, 2024

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@HugoWenTD
Copy link
Contributor Author

HugoWenTD commented Mar 1, 2024

Test results

Example for a local run with ./support-files/ann-benchmark/run-local.sh:

Click to expand
wenhug@ud83c070d9ea75a:~/workspace/server$ ./support-files/ann-benchmark/run-local.sh
Downloading ann-benchmark...

Cloning into '/home/ANT.AMAZON.COM/wenhug/workspace/server/ann-workspace/ann-benchmarks'...
remote: Enumerating objects: 237, done.
remote: Counting objects: 100% (237/237), done.
remote: Compressing objects: 100% (214/214), done.
remote: Total 237 (delta 23), reused 152 (delta 18), pack-reused 0
Receiving objects: 100% (237/237), 1.60 MiB | 9.34 MiB/s, done.
Resolving deltas: 100% (23/23), done.
Installing ann-benchmark dependencies...

Starting ann-benchmark...

downloading https://ann-benchmarks.com/random-xs-20-euclidean.hdf5 -> data/random-xs-20-euclidean.hdf5...
Cannot download https://ann-benchmarks.com/random-xs-20-euclidean.hdf5
Creating dataset locally
Splitting 10000*None into train/test
train size: 9000 * 20
test size:  1000 * 20
0/1000...
2024-03-18 11:17:30,522 - annb - INFO - running only mariadb
2024-03-18 11:17:30,526 - annb - INFO - Order: [Definition(algorithm='mariadb', constructor='MariaDB', module='ann_benchmarks.algorithms.mariadb', docker_tag='ann-benchmarks-mariadb', arguments=['euclidean', {'M': 24, 'efConstruction': 200}], query_argument_groups=[[10], [20], [40], [80], [120], [200], [400], [800]], disabled=False), Definition(algorithm='mariadb', constructor='MariaDB', module='ann_benchmarks.algorithms.mariadb', docker_tag='ann-benchmarks-mariadb', arguments=['euclidean', {'M': 16, 'efConstruction': 200}], query_argument_groups=[[10], [20], [40], [80], [120], [200], [400], [800]], disabled=False)]
Trying to instantiate ann_benchmarks.algorithms.mariadb.MariaDB(['euclidean', {'M': 24, 'efConstruction': 200}])

Setup paths:
MARIADB_ROOT_DIR: /home/ANT.AMAZON.COM/wenhug/workspace/server/builddir
DATA_DIR: /home/ANT.AMAZON.COM/wenhug/workspace/server/ann-workspace/mariadb-workspace/data
LOG_FILE: /home/ANT.AMAZON.COM/wenhug/workspace/server/ann-workspace/mariadb-workspace/mariadb.err
SOCKET_FILE: /tmp/mysql_4gl2e5ms.sock


Initialize MariaDB database...
/home/ANT.AMAZON.COM/wenhug/workspace/server/builddir/*/mariadb-install-db --no-defaults --verbose --skip-name-resolve --skip-test-db --datadir=/home/ANT.AMAZON.COM/wenhug/workspace/server/ann-workspace/mariadb-workspace/data --srcdir=/home/ANT.AMAZON.COM/wenhug/workspace/server/support-files/ann-benchmark/../..
mysql.user table already exists!
Run mariadb-upgrade, not mariadb-install-db

Starting MariaDB server...
/home/ANT.AMAZON.COM/wenhug/workspace/server/builddir/*/mariadbd --no-defaults --datadir=/home/ANT.AMAZON.COM/wenhug/workspace/server/ann-workspace/mariadb-workspace/data --log_error=/home/ANT.AMAZON.COM/wenhug/workspace/server/ann-workspace/mariadb-workspace/mariadb.err --socket=/tmp/mysql_4gl2e5ms.sock --skip_networking --skip_grant_tables  &

MariaDB server started!
Got a train set of size (9000 * 20)
Got 1000 queries

Preparing database and table...

Inserting data...

Insert time for 180000 records: 0.4894428253173828

Creating index...

Index creation time: 9.5367431640625e-07
Built index in 0.5406086444854736
Index size:  128.0
Running query argument group 1 of 8...
Run 1/1...
Processed 1000/1000 queries...
Running query argument group 2 of 8...
Run 1/1...
Processed 1000/1000 queries...
Running query argument group 3 of 8...
Run 1/1...
Processed 1000/1000 queries...
Running query argument group 4 of 8...
Run 1/1...
Processed 1000/1000 queries...
Running query argument group 5 of 8...
Run 1/1...
Processed 1000/1000 queries...
Running query argument group 6 of 8...
Run 1/1...
Processed 1000/1000 queries...
Running query argument group 7 of 8...
Run 1/1...
Processed 1000/1000 queries...
Running query argument group 8 of 8...
Run 1/1...
Processed 1000/1000 queries...
Trying to instantiate ann_benchmarks.algorithms.mariadb.MariaDB(['euclidean', {'M': 16, 'efConstruction': 200}])

Setup paths:
MARIADB_ROOT_DIR: /home/ANT.AMAZON.COM/wenhug/workspace/server/builddir
DATA_DIR: /home/ANT.AMAZON.COM/wenhug/workspace/server/ann-workspace/mariadb-workspace/data
LOG_FILE: /home/ANT.AMAZON.COM/wenhug/workspace/server/ann-workspace/mariadb-workspace/mariadb.err
SOCKET_FILE: /tmp/mysql_q1gbgaf3.sock


Initialize MariaDB database...
/home/ANT.AMAZON.COM/wenhug/workspace/server/builddir/*/mariadb-install-db --no-defaults --verbose --skip-name-resolve --skip-test-db --datadir=/home/ANT.AMAZON.COM/wenhug/workspace/server/ann-workspace/mariadb-workspace/data --srcdir=/home/ANT.AMAZON.COM/wenhug/workspace/server/support-files/ann-benchmark/../..
mysql.user table already exists!
Run mariadb-upgrade, not mariadb-install-db

Starting MariaDB server...
/home/ANT.AMAZON.COM/wenhug/workspace/server/builddir/*/mariadbd --no-defaults --datadir=/home/ANT.AMAZON.COM/wenhug/workspace/server/ann-workspace/mariadb-workspace/data --log_error=/home/ANT.AMAZON.COM/wenhug/workspace/server/ann-workspace/mariadb-workspace/mariadb.err --socket=/tmp/mysql_q1gbgaf3.sock --skip_networking --skip_grant_tables  &

MariaDB server started!
Got a train set of size (9000 * 20)
Got 1000 queries

Preparing database and table...

Inserting data...

Insert time for 180000 records: 0.4275703430175781

Creating index...

Index creation time: 1.1920928955078125e-06
Built index in 0.48961424827575684
Index size:  0.0
Running query argument group 1 of 8...
Run 1/1...
Processed 1000/1000 queries...
Running query argument group 2 of 8...
Run 1/1...
Processed 1000/1000 queries...
Running query argument group 3 of 8...
Run 1/1...
Processed 1000/1000 queries...
Running query argument group 4 of 8...
Run 1/1...
Processed 1000/1000 queries...
Running query argument group 5 of 8...
Run 1/1...
Processed 1000/1000 queries...
Running query argument group 6 of 8...
Run 1/1...
Processed 1000/1000 queries...
Running query argument group 7 of 8...
Run 1/1...
Processed 1000/1000 queries...
Running query argument group 8 of 8...
Run 1/1...
Processed 1000/1000 queries...
2024-03-18 11:17:57,147 - annb - INFO - Terminating 1 workers

Ann-benchmark exporting data...

Looking at dataset deep-image-96-angular
Looking at dataset fashion-mnist-784-euclidean
Looking at dataset gist-960-euclidean
Looking at dataset glove-25-angular
Looking at dataset glove-50-angular
Looking at dataset glove-100-angular
Looking at dataset glove-200-angular
Looking at dataset mnist-784-euclidean
Looking at dataset random-xs-20-euclidean
Computing knn metrics
Computing epsilon metrics
Computing epsilon metrics
Computing rel metrics
Computing knn metrics
Computing epsilon metrics
Computing epsilon metrics
Computing rel metrics
Computing knn metrics
Computing epsilon metrics
Computing epsilon metrics
Computing rel metrics
Computing knn metrics
Computing epsilon metrics
Computing epsilon metrics
Computing rel metrics
Computing knn metrics
Computing epsilon metrics
Computing epsilon metrics
Computing rel metrics
Computing knn metrics
Computing epsilon metrics
Computing epsilon metrics
Computing rel metrics
Computing knn metrics
Computing epsilon metrics
Computing epsilon metrics
Computing rel metrics
Computing knn metrics
Computing epsilon metrics
Computing epsilon metrics
Computing rel metrics
Computing knn metrics
Computing epsilon metrics
Computing epsilon metrics
Computing rel metrics
Computing knn metrics
Computing epsilon metrics
Computing epsilon metrics
Computing rel metrics
Computing knn metrics
Computing epsilon metrics
Computing epsilon metrics
Computing rel metrics
Computing knn metrics
Computing epsilon metrics
Computing epsilon metrics
Computing rel metrics
Computing knn metrics
Computing epsilon metrics
Computing epsilon metrics
Computing rel metrics
Computing knn metrics
Computing epsilon metrics
Computing epsilon metrics
Computing rel metrics
Computing knn metrics
Computing epsilon metrics
Computing epsilon metrics
Computing rel metrics
Computing knn metrics
Computing epsilon metrics
Computing epsilon metrics
Computing rel metrics
Looking at dataset random-s-100-euclidean
Looking at dataset random-xs-20-angular
Looking at dataset random-s-100-angular
Looking at dataset random-xs-16-hamming
Looking at dataset random-s-128-hamming
Looking at dataset random-l-256-hamming
Looking at dataset random-s-jaccard
Looking at dataset random-l-jaccard
Looking at dataset sift-128-euclidean
Looking at dataset nytimes-256-angular
Looking at dataset nytimes-16-angular
Looking at dataset word2bits-800-hamming
Looking at dataset lastfm-64-dot
Looking at dataset sift-256-hamming
Looking at dataset kosarak-jaccard
Looking at dataset movielens1m-jaccard
Looking at dataset movielens10m-jaccard
Looking at dataset movielens20m-jaccard
Looking at dataset dbpedia-openai-100k-angular
Looking at dataset dbpedia-openai-200k-angular
Looking at dataset dbpedia-openai-300k-angular
Looking at dataset dbpedia-openai-400k-angular
Looking at dataset dbpedia-openai-500k-angular
Looking at dataset dbpedia-openai-600k-angular
Looking at dataset dbpedia-openai-700k-angular
Looking at dataset dbpedia-openai-800k-angular
Looking at dataset dbpedia-openai-900k-angular
Looking at dataset dbpedia-openai-1000k-angular

Ann-benchmark plotting...

writing output to results/random-xs-20-euclidean.png
Found cached result
  0:                                 MariaDB(m=16, ef_construction=200, ef_search=40)        1.000     1007.832
Found cached result
  1:                                MariaDB(m=24, ef_construction=200, ef_search=400)        1.000      941.649
Found cached result
  2:                                 MariaDB(m=24, ef_construction=200, ef_search=10)        1.000     1140.663
Found cached result
  3:                                 MariaDB(m=24, ef_construction=200, ef_search=20)        1.000      988.373
Found cached result
  4:                                MariaDB(m=24, ef_construction=200, ef_search=120)        1.000     1091.114
Found cached result
  5:                                MariaDB(m=16, ef_construction=200, ef_search=120)        1.000      998.908
Found cached result
  6:                                 MariaDB(m=16, ef_construction=200, ef_search=20)        1.000     1021.691
Found cached result
  7:                                 MariaDB(m=24, ef_construction=200, ef_search=40)        1.000      823.179
Found cached result
  8:                                MariaDB(m=16, ef_construction=200, ef_search=400)        1.000     1079.078
Found cached result
  9:                                MariaDB(m=24, ef_construction=200, ef_search=800)        1.000     1218.009
Found cached result
 10:                                MariaDB(m=24, ef_construction=200, ef_search=200)        1.000      870.886
Found cached result
 11:                                MariaDB(m=16, ef_construction=200, ef_search=800)        1.000     1058.689
Found cached result
 12:                                 MariaDB(m=24, ef_construction=200, ef_search=80)        1.000      851.237
Found cached result
 13:                                MariaDB(m=16, ef_construction=200, ef_search=200)        1.000      930.801
Found cached result
 14:                                 MariaDB(m=16, ef_construction=200, ef_search=80)        1.000     1208.318
Found cached result
 15:                                 MariaDB(m=16, ef_construction=200, ef_search=10)        1.000      913.258

Ann-benchmark plot done, the last two colunms in above output for 'recall rate' and 'QPS'. ^^^ 


[COMPLETED]


Example for a local run with ./support-files/ann-benchmark/run-docker.sh (when doing an incremental build):

Click to expand
wenhug@ud83c070d9ea75a:~/workspace/server$ ./support-files/ann-benchmark/run-docker.sh
Docker image found.
-- Running cmake version 3.22.1
-- MariaDB 11.4.0
-- Updating submodules
-- Could NOT find PkgConfig (missing: PKG_CONFIG_EXECUTABLE) 
== Configuring MariaDB Connector/C
-- SYSTEM_LIBS: /usr/lib/x86_64-linux-gnu/libz.so;dl;m;dl;m;/usr/lib/x86_64-linux-gnu/libssl.so;/usr/lib/x86_64-linux-gnu/libcrypto.so;/usr/lib/x86_64-linux-gnu/libz.so
-- Configuring OQGraph
-- Configuring done
-- Generating done
-- Build files have been written to: /build/ann-workspace/builddir
[13/13] Linking CXX executable extra/mariabackup/mariadb-backup
Downloading ann-benchmark...

[WARN] ann-benchmarks repository already exists. Skipping cloning. Remove /build/server/ann-workspace/ann-benchmarks if you want it to be re-initialized.

Installing ann-benchmark dependencies...

WARNING: The directory '/.cache/pip' or its parent directory is not owned or is not writable by the current user. The cache has been disabled. Check the permissions and owner of that directory. If executing pip with sudo, you should use sudo's -H flag.
Starting ann-benchmark...

2024-03-18 18:18:54,384 - annb - INFO - running only mariadb
2024-03-18 18:18:54,393 - annb - INFO - Order: [Definition(algorithm='mariadb', constructor='MariaDB', module='ann_benchmarks.algorithms.mariadb', docker_tag='ann-benchmarks-mariadb', arguments=['euclidean', {'M': 16, 'efConstruction': 200}], query_argument_groups=[[10], [20], [40], [80], [120], [200], [400], [800]], disabled=False), Definition(algorithm='mariadb', constructor='MariaDB', module='ann_benchmarks.algorithms.mariadb', docker_tag='ann-benchmarks-mariadb', arguments=['euclidean', {'M': 24, 'efConstruction': 200}], query_argument_groups=[[10], [20], [40], [80], [120], [200], [400], [800]], disabled=False)]
Trying to instantiate ann_benchmarks.algorithms.mariadb.MariaDB(['euclidean', {'M': 16, 'efConstruction': 200}])

Setup paths:
MARIADB_ROOT_DIR: /build/ann-workspace/builddir
DATA_DIR: /build/server/ann-workspace/mariadb-workspace/data
LOG_FILE: /build/server/ann-workspace/mariadb-workspace/mariadb.err
SOCKET_FILE: /tmp/mysql_4yk6c666.sock

Could not get current user, could be docker user mapping. Ignore.

Initialize MariaDB database...
/build/ann-workspace/builddir/*/mariadb-install-db --no-defaults --verbose --skip-name-resolve --skip-test-db --datadir=/build/server/ann-workspace/mariadb-workspace/data --srcdir=/build/server/support-files/ann-benchmark/../..
mysql.user table already exists!
Run mariadb-upgrade, not mariadb-install-db

Starting MariaDB server...
/build/ann-workspace/builddir/*/mariadbd --no-defaults --datadir=/build/server/ann-workspace/mariadb-workspace/data --log_error=/build/server/ann-workspace/mariadb-workspace/mariadb.err --socket=/tmp/mysql_4yk6c666.sock --skip_networking --skip_grant_tables  &

MariaDB server started!
Got a train set of size (9000 * 20)
Got 1000 queries

Preparing database and table...

Inserting data...

Insert time for 180000 records: 0.43891072273254395

Creating index...

Index creation time: 1.1920928955078125e-06
Built index in 0.4922800064086914
Index size:  128.0
Running query argument group 1 of 8...
Run 1/1...
Processed 1000/1000 queries...
Running query argument group 2 of 8...
Run 1/1...
Processed 1000/1000 queries...
Running query argument group 3 of 8...
Run 1/1...
Processed 1000/1000 queries...
Running query argument group 4 of 8...
Run 1/1...
Processed 1000/1000 queries...
Running query argument group 5 of 8...
Run 1/1...
Processed 1000/1000 queries...
Running query argument group 6 of 8...
Run 1/1...
Processed 1000/1000 queries...
Running query argument group 7 of 8...
Run 1/1...
Processed 1000/1000 queries...
Running query argument group 8 of 8...
Run 1/1...
Processed 1000/1000 queries...
Trying to instantiate ann_benchmarks.algorithms.mariadb.MariaDB(['euclidean', {'M': 24, 'efConstruction': 200}])

Setup paths:
MARIADB_ROOT_DIR: /build/ann-workspace/builddir
DATA_DIR: /build/server/ann-workspace/mariadb-workspace/data
LOG_FILE: /build/server/ann-workspace/mariadb-workspace/mariadb.err
SOCKET_FILE: /tmp/mysql_renlus59.sock

Could not get current user, could be docker user mapping. Ignore.

Initialize MariaDB database...
/build/ann-workspace/builddir/*/mariadb-install-db --no-defaults --verbose --skip-name-resolve --skip-test-db --datadir=/build/server/ann-workspace/mariadb-workspace/data --srcdir=/build/server/support-files/ann-benchmark/../..
mysql.user table already exists!
Run mariadb-upgrade, not mariadb-install-db

Starting MariaDB server...
/build/ann-workspace/builddir/*/mariadbd --no-defaults --datadir=/build/server/ann-workspace/mariadb-workspace/data --log_error=/build/server/ann-workspace/mariadb-workspace/mariadb.err --socket=/tmp/mysql_renlus59.sock --skip_networking --skip_grant_tables  &

MariaDB server started!
Got a train set of size (9000 * 20)
Got 1000 queries

Preparing database and table...

Inserting data...

Insert time for 180000 records: 0.3983802795410156

Creating index...

Index creation time: 1.1920928955078125e-06
Built index in 0.4507639408111572
Index size:  0.0
Running query argument group 1 of 8...
Run 1/1...
Processed 1000/1000 queries...
Running query argument group 2 of 8...
Run 1/1...
Processed 1000/1000 queries...
Running query argument group 3 of 8...
Run 1/1...
Processed 1000/1000 queries...
Running query argument group 4 of 8...
Run 1/1...
Processed 1000/1000 queries...
Running query argument group 5 of 8...
Run 1/1...
Processed 1000/1000 queries...
Running query argument group 6 of 8...
Run 1/1...
Processed 1000/1000 queries...
Running query argument group 7 of 8...
Run 1/1...
Processed 1000/1000 queries...
Running query argument group 8 of 8...
Run 1/1...
Processed 1000/1000 queries...
2024-03-18 18:19:22,024 - annb - INFO - Terminating 1 workers

Ann-benchmark exporting data...

Looking at dataset deep-image-96-angular
Looking at dataset fashion-mnist-784-euclidean
Looking at dataset gist-960-euclidean
Looking at dataset glove-25-angular
Looking at dataset glove-50-angular
Looking at dataset glove-100-angular
Looking at dataset glove-200-angular
Looking at dataset mnist-784-euclidean
Looking at dataset random-xs-20-euclidean
Computing knn metrics
Computing epsilon metrics
Computing epsilon metrics
Computing rel metrics
Computing knn metrics
Computing epsilon metrics
Computing epsilon metrics
Computing rel metrics
Computing knn metrics
Computing epsilon metrics
Computing epsilon metrics
Computing rel metrics
Computing knn metrics
Computing epsilon metrics
Computing epsilon metrics
Computing rel metrics
Computing knn metrics
Computing epsilon metrics
Computing epsilon metrics
Computing rel metrics
Computing knn metrics
Computing epsilon metrics
Computing epsilon metrics
Computing rel metrics
Computing knn metrics
Computing epsilon metrics
Computing epsilon metrics
Computing rel metrics
Computing knn metrics
Computing epsilon metrics
Computing epsilon metrics
Computing rel metrics
Computing knn metrics
Computing epsilon metrics
Computing epsilon metrics
Computing rel metrics
Computing knn metrics
Computing epsilon metrics
Computing epsilon metrics
Computing rel metrics
Computing knn metrics
Computing epsilon metrics
Computing epsilon metrics
Computing rel metrics
Computing knn metrics
Computing epsilon metrics
Computing epsilon metrics
Computing rel metrics
Computing knn metrics
Computing epsilon metrics
Computing epsilon metrics
Computing rel metrics
Computing knn metrics
Computing epsilon metrics
Computing epsilon metrics
Computing rel metrics
Computing knn metrics
Computing epsilon metrics
Computing epsilon metrics
Computing rel metrics
Computing knn metrics
Computing epsilon metrics
Computing epsilon metrics
Computing rel metrics
Looking at dataset random-s-100-euclidean
Looking at dataset random-xs-20-angular
Looking at dataset random-s-100-angular
Looking at dataset random-xs-16-hamming
Looking at dataset random-s-128-hamming
Looking at dataset random-l-256-hamming
Looking at dataset random-s-jaccard
Looking at dataset random-l-jaccard
Looking at dataset sift-128-euclidean
Looking at dataset nytimes-256-angular
Looking at dataset nytimes-16-angular
Looking at dataset word2bits-800-hamming
Looking at dataset lastfm-64-dot
Looking at dataset sift-256-hamming
Looking at dataset kosarak-jaccard
Looking at dataset movielens1m-jaccard
Looking at dataset movielens10m-jaccard
Looking at dataset movielens20m-jaccard
Looking at dataset dbpedia-openai-100k-angular
Looking at dataset dbpedia-openai-200k-angular
Looking at dataset dbpedia-openai-300k-angular
Looking at dataset dbpedia-openai-400k-angular
Looking at dataset dbpedia-openai-500k-angular
Looking at dataset dbpedia-openai-600k-angular
Looking at dataset dbpedia-openai-700k-angular
Looking at dataset dbpedia-openai-800k-angular
Looking at dataset dbpedia-openai-900k-angular
Looking at dataset dbpedia-openai-1000k-angular

Ann-benchmark plotting...

Matplotlib created a temporary config/cache directory at /tmp/matplotlib-tuav14cy because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
writing output to results/random-xs-20-euclidean.png
Found cached result
  0:                                 MariaDB(m=16, ef_construction=200, ef_search=40)        1.000      841.485
Found cached result
  1:                                MariaDB(m=24, ef_construction=200, ef_search=400)        1.000      896.407
Found cached result
  2:                                 MariaDB(m=24, ef_construction=200, ef_search=10)        1.000      827.326
Found cached result
  3:                                 MariaDB(m=24, ef_construction=200, ef_search=20)        1.000      875.636
Found cached result
  4:                                MariaDB(m=24, ef_construction=200, ef_search=120)        1.000      877.246
Found cached result
  5:                                MariaDB(m=16, ef_construction=200, ef_search=120)        1.000      843.912
Found cached result
  6:                                 MariaDB(m=16, ef_construction=200, ef_search=20)        1.000      844.746
Found cached result
  7:                                 MariaDB(m=24, ef_construction=200, ef_search=40)        1.000     1006.725
Found cached result
  8:                                MariaDB(m=16, ef_construction=200, ef_search=400)        1.000     1143.344
Found cached result
  9:                                MariaDB(m=24, ef_construction=200, ef_search=800)        1.000      769.048
Found cached result
 10:                                MariaDB(m=24, ef_construction=200, ef_search=200)        1.000     1011.292
Found cached result
 11:                                MariaDB(m=16, ef_construction=200, ef_search=800)        1.000      938.419
Found cached result
 12:                                 MariaDB(m=24, ef_construction=200, ef_search=80)        1.000      972.378
Found cached result
 13:                                MariaDB(m=16, ef_construction=200, ef_search=200)        1.000      839.023
Found cached result
 14:                                 MariaDB(m=16, ef_construction=200, ef_search=80)        1.000      798.808
Found cached result
 15:                                 MariaDB(m=16, ef_construction=200, ef_search=10)        1.000      912.495

Ann-benchmark plot done, the last two colunms in above output for 'recall rate' and 'QPS'. ^^^ 


[COMPLETED]


New Git-Lab CI Job passed.

Ignore other failed jobs as the development branch does not build for some plugins:

image

@HugoWenTD HugoWenTD force-pushed the bb-11.4-vec-ann-benchmark branch 2 times, most recently from ac8d6d9 to 7a3507a Compare March 8, 2024 05:09
@HugoWenTD HugoWenTD force-pushed the bb-11.4-vec-ann-benchmark branch 2 times, most recently from 38fba8a to 663c971 Compare March 19, 2024 20:50
@HugoWenTD HugoWenTD changed the base branch from bb-11.4-vec to bb-11.4-vec-vicentiu March 26, 2024 22:43
@HugoWenTD HugoWenTD force-pushed the bb-11.4-vec-ann-benchmark branch 2 times, most recently from 746db91 to 205a394 Compare April 9, 2024 18:01
Introduce scripts and Dockerfile for executing the `ann-benchmarks` tool,
aimed at vector search performance testing. Support running the ANN
benchmarking both in GitLab CI and manually.

Developer Interface:

Both of the scripts provide flexibility for altering default behavior via
environment variables. Refer to the detailed description in the scripts'
documentation section.

- `run-local.sh`:
  This script facilitates the execution of the ANN (Approximate Nearest
  Neighbors) benchmarking test either against local builds or a specified
  folder where the MariaDB server is installed.

- `run-docker.sh`:
  This script automates the execution of the ANN benchmarking test within
  a Docker container.
  It builds the required Docker image if it doesn't exist or if forced,
  then build the source code and runs the benchmark in the specified
  workspace.

GitLab CI Build:

- A new job `ann-benchmark` is included in the test stage. This job runs
  ann-benchmark against the MariaDB server built in Ubuntu 22.04.
  Initially, we are using the `random-xs-20-euclidean` dataset with 20
  dimensions and 10000 records.
@vuvova
Copy link
Member

vuvova commented Jul 8, 2024

@HugoWenTD , could you add support for --batch ? I tried, like

diff --git a/ann_benchmarks/algorithms/mariadb/module.py b/ann_benchmarks/algorithms/mariadb/module.py
index 382ea70..89efce1 100644
--- a/ann_benchmarks/algorithms/mariadb/module.py
+++ b/ann_benchmarks/algorithms/mariadb/module.py
@@ -8,6 +8,7 @@ import subprocess
 import sys
 import tempfile
 import time
+import threading
 
 import mariadb
 
@@ -25,7 +26,7 @@ class MariaDB(BaseANN):
         self._test_time = time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())
         self._metric = metric
         self._m = method_param['M']
-        self._cur = None
+        self._ = threading.local()
         self._perf_proc = None
         self._perf_records = []
         self._perf_stats = []
@@ -45,7 +46,7 @@ class MariaDB(BaseANN):
 
         # Connect to MariaDB using Unix socket
         conn = mariadb.connect(unix_socket=self._socket_file)
-        self._cur = conn.cursor()
+        self._.cur = conn.cursor()
 
     def prepare_options(self):
         self._perf_stat = os.environ.get('PERF', 'no') == 'yes' and MariaDB.can_run_perf()
@@ -247,15 +248,15 @@ class MariaDB(BaseANN):
     def fit(self, X):
         # Prepare database and table
         print("\nPreparing database and table...")
-        self._cur.execute("DROP DATABASE IF EXISTS ann")
-        self._cur.execute("CREATE DATABASE ann")
-        self._cur.execute("USE ann")
-        self._cur.execute("SET mhnsw_max_edges_per_node = %d" % self._m)
-        self._cur.execute("SET rand_seed1=1, rand_seed2=2")
+        self._.cur.execute("DROP DATABASE IF EXISTS ann")
+        self._.cur.execute("CREATE DATABASE ann")
+        self._.cur.execute("USE ann")
+        self._.cur.execute("SET mhnsw_max_edges_per_node = %d" % self._m)
+        self._.cur.execute("SET rand_seed1=1, rand_seed2=2")
         # Innodb create table with index is not supported with the latest commit of the develop branch.
         # Once all supported we could use:
-        #self._cur.execute("CREATE TABLE t1 (id INT PRIMARY KEY, v BLOB NOT NULL, vector INDEX (v)) ENGINE=InnoDB;")
-        self._cur.execute("CREATE TABLE t1 (id INT PRIMARY KEY, v BLOB NOT NULL, vector INDEX (v)) ENGINE=MyISAM;")
+        #self._.cur.execute("CREATE TABLE t1 (id INT PRIMARY KEY, v BLOB NOT NULL, vector INDEX (v)) ENGINE=InnoDB;")
+        self._.cur.execute("CREATE TABLE t1 (id INT PRIMARY KEY, v BLOB NOT NULL, vector INDEX (v)) ENGINE=MyISAM;")
 
         # Insert data
         print("\nInserting data...")
@@ -263,11 +264,11 @@ class MariaDB(BaseANN):
         start_time = time.time()
         rps = 10000
         for i, embedding in enumerate(X):
-            self._cur.execute("INSERT INTO t1 (id, v) VALUES (%d, %s)", (i, bytes(vector_to_hex(embedding))))
+            self._.cur.execute("INSERT INTO t1 (id, v) VALUES (%d, %s)", (i, bytes(vector_to_hex(embedding))))
             if i % int(rps + 1) == 1:
                 rps=i/(time.time()-start_time)
                 print(f"{i:6d} of {len(X)}, {rps:4.2f} stmt/sec, ETA {(len(X)-i)/rps:.0f} sec")
-        self._cur.execute("commit")
+        self._.cur.execute("commit")
         self.perf_stop()
         print(f"\nInsert time for {X.size} records: {time.time() - start_time:7.2f}")
 
@@ -280,7 +281,7 @@ class MariaDB(BaseANN):
         elif self._metric == "euclidean":
             # The feature is being developed
             # Currently stack will be empty for indexing in perf data as nothing is executed
-            #self._cur.execute("ALTER TABLE `t1` ADD VECTOR INDEX (v);")
+            #self._.cur.execute("ALTER TABLE `t1` ADD VECTOR INDEX (v);")
             pass
         else:
             pass
@@ -292,25 +293,32 @@ class MariaDB(BaseANN):
     def set_query_arguments(self, ef_search):
         # Set ef_search
         self._ef_search = ef_search
-        self._cur.execute("SET mhnsw_limit_multiplier = %d/10" % ef_search)
+        self._.cur.execute("SET mhnsw_limit_multiplier = %d/10" % ef_search)
 
     def query(self, v, n):
-        self._cur.execute("SELECT id FROM t1 ORDER by vec_distance(v, %s) LIMIT %d", (bytes(vector_to_hex(v)), n))
-        return [id for id, in self._cur.fetchall()]
+        if not hasattr(self._, 'cur'):
+            conn = mariadb.connect(unix_socket=self._socket_file)
+            self._.cur = conn.cursor()
+            self._.cur.execute("USE ann")
+            self._.cur.execute("SET mhnsw_limit_multiplier = %d/10" % self._ef_search)
+            self._.cur.execute("SET rand_seed1=13, rand_seed2=29")
+
+        self._.cur.execute("SELECT id FROM t1 ORDER by vec_distance(v, %s) LIMIT %d", (bytes(vector_to_hex(v)), n))
+        return [id for id, in self._.cur.fetchall()]
 
     # TODO for MariaDB, get the memory usage when index is supported:
     # def get_memory_usage(self):
-    #      if self._cur is None:
+    #      if self._.cur is None:
     #         return 0
-    #      self._cur.execute("")
-    #      return self._cur.fetchone()[0] / 1024
+    #      self._.cur.execute("")
+    #      return self._.cur.fetchone()[0] / 1024
 
     def __str__(self):
         return f"MariaDB(m={self._m:2d}, ef_search={self._ef_search})"
 
     def done(self):
         # Shutdown MariaDB server when benchmarking done
-        self._cur.execute("shutdown")
+        self._.cur.execute("shutdown")
         # Stop perf for searching and do final analysis
         self.perf_stop()
         self.perf_analysis()

that works, but if you run ../build/client/mariadb-admin --socket /tmp/mysql_*.sock processlist -i1 while the benchmark is running, you'll see many connections in the server, but only one of them — at most — is running the query. Python, indeed, creates os.cpu_count() threads, but they don't run in parallel.

@vuvova
Copy link
Member

vuvova commented Jul 8, 2024

I've managed to make it work with Pool (default BaseANN.batch_query uses ThreadPool) but it looks quite awful

@HugoWenTD
Copy link
Contributor Author

, but they don't run in parallel.

@vuvova I think might be related to the ann-benchmark framework, I'll investigate it further once I have some time from other tasks.

@vuvova
Copy link
Member

vuvova commented Jul 9, 2024

See https://github.com/vuvova/ann-benchmarks/commits/dev/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants