MDEV-33408 Alter HNSW graph storage and fix memory leak #3197

HugoWenTD · 2024-04-12T23:06:22Z

Description

This commit changes the way HNSW graph information is stored in the second table. Instead of storing connections as separate records, it now stores neighbors for each node, leading to significant performance improvements and storage savings.

Comparing with the previous approach, the insert speed is 5 times faster, search speed improves by 23%, and storage usage is reduced by 73%, based on ann-benchmark tests with random-xs-20-euclidean and random-s-100-euclidean datasets.

Additionally, in previous code, vector objects were not released after use, resulting in excessive memory consumption (over 20GB for building the index with 90,000 records), preventing tests with large datasets.
Now ensure that vectors are released appropriately during the insert and search functions. Note there are still some vectors that need to be cleaned up after search query completion. Needs to be addressed in a future commit.

How can this PR be tested?

Comparing the performance with previous method of saving connections, using ann-benchmark tool, significant improvement on insert/search/storage size can be seen:

To isolate the impact of the storage structure on performance, I basically kept the HNSW parameters consistent in the commit. Only different parameter is a EF_SEARCH=10 which could help improve the recall.
"new" version is based on commit in this pull request and "old" version is of commit 00b613a2 which includes my fix in a previous pending PR. )

With dataset of random-xs-20-euclidean:

Metric	old	new	change
Insert time for 9000 records	109.41 sec	22.17 sec	(5 times faster)
K-ANN search recall	0.737	0.782	(+6%)
K-ANN search QPS	687.526	843.368	(+22%)
Second table size (data + index)	8.1M	2.16M	(-73%)

New second table data and index sizes t1#i#01.*:

-rw-rw---- 1 wenhug amazon 1.9M Apr 12 20:59 t1#i#01.MYD
-rw-rw---- 1 wenhug amazon 272K Apr 12 20:59 t1#i#01.MYI
-rw-rw---- 1 wenhug amazon 1.5K Apr 12 20:58 t1.frm
-rw-rw---- 1 wenhug amazon 809K Apr 12 20:59 t1.MYD
-rw-rw---- 1 wenhug amazon  93K Apr 12 20:59 t1.MYI

Old second table data and index sizes t1#i#01.*:

-rw-rw---- 1 wenhug amazon 5.0M Apr 12 17:16 t1#i#01.MYD
-rw-rw---- 1 wenhug amazon 3.1M Apr 12 17:17 t1#i#01.MYI
-rw-rw---- 1 wenhug amazon 1.5K Apr 12 17:15 t1.frm
-rw-rw---- 1 wenhug amazon 809K Apr 12 17:16 t1.MYD
-rw-rw---- 1 wenhug amazon  93K Apr 12 17:17 t1.MYI

With dataset of random-s-100-euclidean:

Metric	old	new	change
Insert time for 90000 records	1328.48 sec	259.15 sec	(5 times faster)
K-ANN search recall	0.221	0.335	(+50%)
K-ANN search QPS	539.019	668.827	(+24%)
Second table size (data + index)	79M	20.8M	(-73%)

New second table data and index sizes t1#i#01.*:

-rw-rw---- 1 wenhug amazon  18M Apr 12 17:30 t1#i#01.MYD
-rw-rw---- 1 wenhug amazon 2.8M Apr 12 17:33 t1#i#01.MYI
-rw-rw---- 1 wenhug amazon 1.5K Apr 12 17:26 t1.frm
-rw-rw---- 1 wenhug amazon  36M Apr 12 17:30 t1.MYD
-rw-rw---- 1 wenhug amazon 905K Apr 12 17:33 t1.MYI

Old second table data and index sizes t1#i#01.*:

-rw-rw---- 1 wenhug amazon  46M Apr 12 17:56 t1#i#01.MYD
-rw-rw---- 1 wenhug amazon  33M Apr 12 17:59 t1#i#01.MYI
-rw-rw---- 1 wenhug amazon 1.5K Apr 12 17:34 t1.frm
-rw-rw---- 1 wenhug amazon  36M Apr 12 17:56 t1.MYD
-rw-rw---- 1 wenhug amazon 905K Apr 12 17:59 t1.MYI

CLAassistant · 2024-04-12T23:06:28Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

This commit changes the way HNSW graph information is stored in the second table. Instead of storing connections as separate records, it now stores neighbors for each node, leading to significant performance improvements and storage savings. Comparing with the previous approach, the insert speed is 5 times faster, search speed improves by 23%, and storage usage is reduced by 73%, based on ann-benchmark tests with random-xs-20-euclidean and random-s-100-euclidean datasets. Additionally, in previous code, vector objects were not released after use, resulting in excessive memory consumption (over 20GB for building the index with 90,000 records), preventing tests with large datasets. Now ensure that vectors are released appropriately during the insert and search functions. Note there are still some vectors that need to be cleaned up after search query completion. Needs to be addressed in a future commit. All new code of the whole pull request, including one or several files that are either new files or modified ones, are contributed under the BSD-new license. I am contributing on behalf of my employer Amazon Web Services, Inc.

LinuxJedi · 2024-04-15T09:17:37Z

Assigned to @cvicentiu as it is against his branch.

cvicentiu · 2024-04-15T10:52:11Z

Hi @HugoWenTD!

Thank you for the contribution.

I will address your work in order of appearance:

The benchmark tool is imensely helpful. I will rebase it at the beggining of bb-11.4-vec-vicentiu for ease of use for all of us. We will have to clean it up a bit, and perhaps it should not reside in the MariaDB Server repo directly, but that's a problem for another time, not relevant during this development effort.

In any case, I have used it to double check your findings and I can reproduce everything (except the results for this particular PR, but more on this later)

The "bugfix" you suggested on only deleting one-directional connections: (PR: MDEV-33408 Update insert algorithm for HNSW index #3156)
I've re-read the paper a few times and I strongly believe that the graph is only meant to be considered an undirected graph, hence all connections are bi-directional. If you can point out a paragraph to me (or any other source) where the original HNSW allows one-directional links, I am more than happy to reconsider.

My understanding (at this point in time) is that by "removing only a subset of links", as you've done in #3156 causes a soft increase in MAX_NEIGHBORS_PER_LAYER. The same effect can be achieved by just bumping the MAX_NEIGHBORS_PER_LAYER variable.

So I am inclined to reject that bugfix and instead focus on tuning the graph construction parameters.

Here is a set of results I got using the following constants, from the commit 2f6b2b3:

  const uint EF_CONSTRUCTION = 32; // max candidate list size to connect to.
  const uint MAX_INSERT_NEIGHBOUR_CONNECTIONS = 100;
  const uint MAX_NEIGHBORS_PER_LAYER = 100;
  const double NORMALIZATION_FACTOR = 1.2;

without any of your changes in PR #3156

  0:    MariaDB(m=16, ef_construction=200, ef_search=10)        0.906     1035.211

So we can achieve 90% recall by having more neighbors per layer and bumping eF_construction. Of course, the QPS drops, as the graph is bigger, but as you've noticed, the code was initially written for correctness, not for speed. Once we've proven corectness, we can look into speed improvements.

I tried your code from this PR (437e214). I get the following results, which I believe are due to a bug somewhere:

  0:    MariaDB(m=16, ef_construction=200, ef_search=10)        0.005     5095.530`

Can you double check your findings and see if you've submitted the correct code to this PR?

In general I agree with your approach of using an adjancency list (layer, src, neighbors[]) instead of tuples (layer, src, dst).

This approach has some benefits for sure:

The number of rows in the table is drastically reduced. (reducing index size)
There is less data duplication in the table. (again reducing index size)
We are more memory friendly and can thus run searches faster. (I believe we should optimize for SELECT, not for INSERT/UPDATE/DELETE, but of course we can not completely slow down INSERTS)
Because of less data to be written to disk, INSERTS should be faster (as your benchmark shows)
The adjacency list approach removes a lot of ha_index_next calls because it loads all neighbours in memory on one lookup. That's great!
The downsides:
Updating a connection will cause a much larger write in the storage engine. We need to benchmark with longer neighbour lists (100 neighbors max for instance).
Delete may be trickier to implement. I had started on working on delete with my version of code (layer, src, dst). But I think that your solution has more promise. The way I'm implementing delete is by using a "deleted" bit for a node. If the node is deleted, we skip it during all iterations / searches. I can probably adapt the solution to work with your code. I'll try this right now and share the result as soon as it's working.
Your benchmark shows that we need to be able to parameterize the index construction at runtime so we can expriment without recompiling. Would you be interested in tackling that bit? You can create some system variables for now and we'll use those. A quick and dirty solution, just to map with your already written benchmark code. I noticed you have some commented out SET commands that you wanted to use:

ann_benchmarks/algorithms/mariadb/module.py
288:        #self._cur.execute("SET hnsw.ef_search = %d" % ef_search)

That's all the thoughts I have for now, let me know when you've double checked the benchmark results for your code and in case I messed up when running it, I'll fix it.

cvicentiu · 2024-04-15T12:37:11Z

@HugoWenTD I've done a little bit more experimenting with your bugfix from #3156. For certain combinations of parameters, it does seem to offer a marginal improvement in both QPS and recall, but I'd like to test this more systematically.

HugoWenTD · 2024-04-17T04:42:47Z

Hi @cvicentiu , Thank you for reviewing the pull requests and provide feedback on the benchmarking tool. I really appreciate your input.

I'm traveling for meetings this week but I will try to address on the items you mentioned later this week or next week. You made an important point that the mMax (max neighbours) affects the update speed when using node->neighbours[]. I believe I did run tests with some bigger mMax values. but for the comparison with previous implementation I only recorded the results of the default parameters. I will gather more data on the impact of mMax and get back to you.

Regarding your comment:

I've re-read the paper a few times and I strongly believe that the graph is only meant to be considered an undirected graph, hence all connections are bi-directional. If you can point out a paragraph to me (or any other source) where the original HNSW allows one-directional links, I am more than happy to reconsider.

Here are a few points supporting the use of one-directional links:

The algorithm 1 in the paper describes the shrink steps:

At line 16, set neighbourhood(e) at layer lc to eNewConn is to shrink only one-direction connection, and it does not mention deleting the connection of the opposite direction.

In addition, PgVector's shrinking step is in this section.
It also shrinks only one direction.
In the worst case, there could be a scenario where vectors gets isolated:
- Assume mMax-level0 is 4
- When (V5) vector 5 is inserted, V4 got isolated.
- If V4 happens to be entry point of higher level. That means none of V1/2/3/5 will be ever traversed in future insert or search.
- While increasing mMax could mitigate this, it still reduces the navigability of the graph, in the benchmark tests, I did see lower recall caused by fewer vectors were returned than the specified limit.

cvicentiu · 2024-04-18T08:56:51Z

sql/vector_mhnsw.cc

 {
- return store_link(graph, layer_number, source_node, source_node);
+ graph->field[2]->store_binary(
+ reinterpret_cast<const char *>(neighbor_array), totalSize);


This is not correct. One can not just store "structs" on disk. Between compilers / architectures the structs might be padded differently. A struct is an in-memory representation that is not guaranteed to be stable on disk. For storing things persistently, one must:

Store things byte-by-byte, explicitly.

Ensure that the storage happens using fixed endinanness. There are uintxkorr functions that can e used.

Note:
This is likely a problem with my first implementation too, when reading floats from disk. We can not assume floats are stored in little endian format. If we are to run on a big-endian machine, the data would not be portable.

This is definitely a great point Vicentiu! I did understand that's not a portable solution for final implementation. In this commit I was mainly to make a workable version to see if the solution makes sense. I should've made an inline comment for this TODO. Also one aspect I wasn't very familiar is the reference size. As you may noticed in the code, I assumed all references of the same table have the same size (in my test the ref_size seems to be 6 bytes). I can look deeper into understanding how the references operate and maybe a more efficient structure for storing the neighbours.

Currently in the code I'm using this structure to save the neigh_ref:

struct NeighborArray { uint8 num; char neigh_ref[]; };

Do you think we need uint8 num here to validate the length of neighbours read from the second table? Or should simply get the size of neighbour list size by neigh_ref/<ref_len got from source table info> ?

cvicentiu · 2024-04-18T08:59:35Z

sql/vector_mhnsw.cc

 }

- for (const auto &node: dest_nodes)
+ // record need update
+ if (cmp_record(graph, record[1]))


This compares full record, even the "uninitialized bytes" of the varbinary column. The low recall I was experiencing with your branch got fixed when removing this line.

Nice catch! I initially had following two lines immediately after reading record to record[1] with ha_index_read_map. Later I was trying to optimize it to avoid storing field2 in some scenarios. But apparently the logic of storing after comparison is incorrect.

graph->field[2]->store_binary( reinterpret_cast<const char *>(neighbor_array), totalSize);

Another thing regarding the recall is that I tried various parameter values last week. I tested your original implementation, your implementation with my changes in this PR, and the implementation in the current PR. However, I always got lower recall rate compared to PGVector using the same parameter sets. There could be something overlooked by us or done differently by PGVector.

HugoWenTD · 2024-04-26T02:40:25Z

Your benchmark shows that we need to be able to parameterize the index construction at runtime so we can expriment without recompiling. Would you be interested in tackling that bit? You can create some system variables for now and we'll use those. A quick and dirty solution, just to map with your already written benchmark code. I noticed you have some commented out SET commands that you wanted to use:

Hi @cvicentiu, I just created this PR #3226 targeting your newly created branch bb-11.4-vec-vicentiu-hugo, for configuring the parameters through system variables. Please be aware that the modification in ann-benchmark is not backward compatible, so I have created a new branch to accommodate this change.

HugoWenTD · 2024-05-07T00:55:58Z

Updating a connection will cause a much larger write in the storage engine. We need to benchmark with longer neighbour lists (100 neighbors max for instance).

Sharing some benchmark results with different m value using DATASET=random-s-100-euclidean. The insert time and search time increased as expected, but still acceptable I think. Will test some more combinations with different ef_construction values.

0: MariaDB(m=100, ef_construction=10, ef_search=10)
recall: 0.846
QPS: 669.383
Insert time : 252.37262177467346

1: MariaDB(m=50, ef_construction=10, ef_search=10)
recall: 0.825
QPS: 678.479
Insert time : 258.6093146800995

2: MariaDB(m=10, ef_construction=10, ef_search=10)
recall: 0.335
QPS:1052.397
Insert time : 192.79994201660156

HugoWenTD marked this pull request as ready for review April 12, 2024 23:33

HugoWenTD force-pushed the bb-11.4-vec-save-neigh branch from 9673c0e to 437e214 Compare April 12, 2024 23:57

LinuxJedi assigned cvicentiu Apr 15, 2024

LinuxJedi requested a review from cvicentiu April 15, 2024 09:17

cvicentiu reviewed Apr 18, 2024

View reviewed changes

HugoWenTD mentioned this pull request Apr 26, 2024

MDEV-33408 Introduce session variables to manage HNSW index parameters #3226

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MDEV-33408 Alter HNSW graph storage and fix memory leak #3197

MDEV-33408 Alter HNSW graph storage and fix memory leak #3197

HugoWenTD commented Apr 12, 2024

CLAassistant commented Apr 12, 2024

LinuxJedi commented Apr 15, 2024

cvicentiu commented Apr 15, 2024

cvicentiu commented Apr 15, 2024

HugoWenTD commented Apr 17, 2024 •

edited

cvicentiu Apr 18, 2024

HugoWenTD Apr 18, 2024

cvicentiu Apr 18, 2024

HugoWenTD Apr 18, 2024

HugoWenTD Apr 18, 2024

HugoWenTD commented Apr 26, 2024

HugoWenTD commented May 7, 2024

MDEV-33408 Alter HNSW graph storage and fix memory leak #3197

Are you sure you want to change the base?

MDEV-33408 Alter HNSW graph storage and fix memory leak #3197

Conversation

HugoWenTD commented Apr 12, 2024

Description

How can this PR be tested?

Comparing the performance with previous method of saving connections, using ann-benchmark tool, significant improvement on insert/search/storage size can be seen:

CLAassistant commented Apr 12, 2024

LinuxJedi commented Apr 15, 2024

cvicentiu commented Apr 15, 2024

cvicentiu commented Apr 15, 2024

HugoWenTD commented Apr 17, 2024 • edited

cvicentiu Apr 18, 2024

Choose a reason for hiding this comment

HugoWenTD Apr 18, 2024

Choose a reason for hiding this comment

cvicentiu Apr 18, 2024

Choose a reason for hiding this comment

HugoWenTD Apr 18, 2024

Choose a reason for hiding this comment

HugoWenTD Apr 18, 2024

Choose a reason for hiding this comment

HugoWenTD commented Apr 26, 2024

HugoWenTD commented May 7, 2024

HugoWenTD commented Apr 17, 2024 •

edited