Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MDEV-33408 Alter HNSW graph storage and fix memory leak #3197

Open
wants to merge 1 commit into
base: bb-11.4-vec-vicentiu
Choose a base branch
from

Conversation

HugoWenTD
Copy link
Contributor

Description

This commit changes the way HNSW graph information is stored in the second table. Instead of storing connections as separate records, it now stores neighbors for each node, leading to significant performance improvements and storage savings.

Comparing with the previous approach, the insert speed is 5 times faster, search speed improves by 23%, and storage usage is reduced by 73%, based on ann-benchmark tests with random-xs-20-euclidean and random-s-100-euclidean datasets.

Additionally, in previous code, vector objects were not released after use, resulting in excessive memory consumption (over 20GB for building the index with 90,000 records), preventing tests with large datasets.
Now ensure that vectors are released appropriately during the insert and search functions. Note there are still some vectors that need to be cleaned up after search query completion. Needs to be addressed in a future commit.

How can this PR be tested?

Comparing the performance with previous method of saving connections, using ann-benchmark tool, significant improvement on insert/search/storage size can be seen:

To isolate the impact of the storage structure on performance, I basically kept the HNSW parameters consistent in the commit. Only different parameter is a EF_SEARCH=10 which could help improve the recall.
"new" version is based on commit in this pull request and "old" version is of commit 00b613a2 which includes my fix in a previous pending PR.
)

  • With dataset of random-xs-20-euclidean:

    Metric old new change
    Insert time for 9000 records 109.41 sec 22.17 sec (5 times faster)
    K-ANN search recall 0.737 0.782 (+6%)
    K-ANN search QPS 687.526 843.368 (+22%)
    Second table size (data + index) 8.1M 2.16M (-73%)
    • New second table data and index sizes t1#i#01.*:
      -rw-rw---- 1 wenhug amazon 1.9M Apr 12 20:59 t1#i#01.MYD
      -rw-rw---- 1 wenhug amazon 272K Apr 12 20:59 t1#i#01.MYI
      -rw-rw---- 1 wenhug amazon 1.5K Apr 12 20:58 t1.frm
      -rw-rw---- 1 wenhug amazon 809K Apr 12 20:59 t1.MYD
      -rw-rw---- 1 wenhug amazon  93K Apr 12 20:59 t1.MYI
      
    • Old second table data and index sizes t1#i#01.*:
      -rw-rw---- 1 wenhug amazon 5.0M Apr 12 17:16 t1#i#01.MYD
      -rw-rw---- 1 wenhug amazon 3.1M Apr 12 17:17 t1#i#01.MYI
      -rw-rw---- 1 wenhug amazon 1.5K Apr 12 17:15 t1.frm
      -rw-rw---- 1 wenhug amazon 809K Apr 12 17:16 t1.MYD
      -rw-rw---- 1 wenhug amazon  93K Apr 12 17:17 t1.MYI
      
  • With dataset of random-s-100-euclidean:

    Metric old new change
    Insert time for 90000 records 1328.48 sec 259.15 sec (5 times faster)
    K-ANN search recall 0.221 0.335 (+50%)
    K-ANN search QPS 539.019 668.827 (+24%)
    Second table size (data + index) 79M 20.8M (-73%)
    • New second table data and index sizes t1#i#01.*:

      -rw-rw---- 1 wenhug amazon  18M Apr 12 17:30 t1#i#01.MYD
      -rw-rw---- 1 wenhug amazon 2.8M Apr 12 17:33 t1#i#01.MYI
      -rw-rw---- 1 wenhug amazon 1.5K Apr 12 17:26 t1.frm
      -rw-rw---- 1 wenhug amazon  36M Apr 12 17:30 t1.MYD
      -rw-rw---- 1 wenhug amazon 905K Apr 12 17:33 t1.MYI
      
    • Old second table data and index sizes t1#i#01.*:

      -rw-rw---- 1 wenhug amazon  46M Apr 12 17:56 t1#i#01.MYD
      -rw-rw---- 1 wenhug amazon  33M Apr 12 17:59 t1#i#01.MYI
      -rw-rw---- 1 wenhug amazon 1.5K Apr 12 17:34 t1.frm
      -rw-rw---- 1 wenhug amazon  36M Apr 12 17:56 t1.MYD
      -rw-rw---- 1 wenhug amazon 905K Apr 12 17:59 t1.MYI
      

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@HugoWenTD HugoWenTD marked this pull request as ready for review April 12, 2024 23:33
This commit changes the way HNSW graph information is stored in the
second table. Instead of storing connections as separate records, it now
stores neighbors for each node, leading to significant performance
improvements and storage savings.

Comparing with the previous approach, the insert speed is 5 times faster,
search speed improves by 23%, and storage usage is reduced by 73%, based
on ann-benchmark tests with random-xs-20-euclidean and
random-s-100-euclidean datasets.

Additionally, in previous code, vector objects were not released after
use, resulting in excessive memory consumption (over 20GB for building
the index with 90,000 records), preventing tests with large datasets.
Now ensure that vectors are released appropriately during the insert and
search functions. Note there are still some vectors that need to be
cleaned up after search query completion. Needs to be addressed in a
future commit.

All new code of the whole pull request, including one or several files
that are either new files or modified ones, are contributed under the
BSD-new license. I am contributing on behalf of my employer Amazon Web
Services, Inc.
@LinuxJedi
Copy link
Contributor

Assigned to @cvicentiu as it is against his branch.

@cvicentiu
Copy link
Member

Hi @HugoWenTD!

Thank you for the contribution.

I will address your work in order of appearance:

  1. The benchmark tool is imensely helpful. I will rebase it at the beggining of bb-11.4-vec-vicentiu for ease of use for all of us. We will have to clean it up a bit, and perhaps it should not reside in the MariaDB Server repo directly, but that's a problem for another time, not relevant during this development effort.

In any case, I have used it to double check your findings and I can reproduce everything (except the results for this particular PR, but more on this later)

  1. The "bugfix" you suggested on only deleting one-directional connections: (PR: MDEV-33408 Update insert algorithm for HNSW index #3156)
    I've re-read the paper a few times and I strongly believe that the graph is only meant to be considered an undirected graph, hence all connections are bi-directional. If you can point out a paragraph to me (or any other source) where the original HNSW allows one-directional links, I am more than happy to reconsider.

My understanding (at this point in time) is that by "removing only a subset of links", as you've done in #3156 causes a soft increase in MAX_NEIGHBORS_PER_LAYER. The same effect can be achieved by just bumping the MAX_NEIGHBORS_PER_LAYER variable.

So I am inclined to reject that bugfix and instead focus on tuning the graph construction parameters.

  1. Here is a set of results I got using the following constants, from the commit 2f6b2b3:
  const uint EF_CONSTRUCTION = 32; // max candidate list size to connect to.
  const uint MAX_INSERT_NEIGHBOUR_CONNECTIONS = 100;
  const uint MAX_NEIGHBORS_PER_LAYER = 100;
  const double NORMALIZATION_FACTOR = 1.2;

without any of your changes in PR #3156

  0:    MariaDB(m=16, ef_construction=200, ef_search=10)        0.906     1035.211

So we can achieve 90% recall by having more neighbors per layer and bumping eF_construction. Of course, the QPS drops, as the graph is bigger, but as you've noticed, the code was initially written for correctness, not for speed. Once we've proven corectness, we can look into speed improvements.

  1. I tried your code from this PR (437e214). I get the following results, which I believe are due to a bug somewhere:
  0:    MariaDB(m=16, ef_construction=200, ef_search=10)        0.005     5095.530`

Can you double check your findings and see if you've submitted the correct code to this PR?

In general I agree with your approach of using an adjancency list (layer, src, neighbors[]) instead of tuples (layer, src, dst).

This approach has some benefits for sure:

  1. The number of rows in the table is drastically reduced. (reducing index size)

  2. There is less data duplication in the table. (again reducing index size)

  3. We are more memory friendly and can thus run searches faster. (I believe we should optimize for SELECT, not for INSERT/UPDATE/DELETE, but of course we can not completely slow down INSERTS)

  4. Because of less data to be written to disk, INSERTS should be faster (as your benchmark shows)

  5. The adjacency list approach removes a lot of ha_index_next calls because it loads all neighbours in memory on one lookup. That's great!
    The downsides:

  6. Updating a connection will cause a much larger write in the storage engine. We need to benchmark with longer neighbour lists (100 neighbors max for instance).

  7. Delete may be trickier to implement. I had started on working on delete with my version of code (layer, src, dst). But I think that your solution has more promise. The way I'm implementing delete is by using a "deleted" bit for a node. If the node is deleted, we skip it during all iterations / searches. I can probably adapt the solution to work with your code. I'll try this right now and share the result as soon as it's working.

  8. Your benchmark shows that we need to be able to parameterize the index construction at runtime so we can expriment without recompiling. Would you be interested in tackling that bit? You can create some system variables for now and we'll use those. A quick and dirty solution, just to map with your already written benchmark code. I noticed you have some commented out SET commands that you wanted to use:

ann_benchmarks/algorithms/mariadb/module.py
288:        #self._cur.execute("SET hnsw.ef_search = %d" % ef_search)

That's all the thoughts I have for now, let me know when you've double checked the benchmark results for your code and in case I messed up when running it, I'll fix it.

@cvicentiu
Copy link
Member

@HugoWenTD I've done a little bit more experimenting with your bugfix from #3156. For certain combinations of parameters, it does seem to offer a marginal improvement in both QPS and recall, but I'd like to test this more systematically.

@HugoWenTD
Copy link
Contributor Author

HugoWenTD commented Apr 17, 2024

Hi @cvicentiu , Thank you for reviewing the pull requests and provide feedback on the benchmarking tool. I really appreciate your input.

I'm traveling for meetings this week but I will try to address on the items you mentioned later this week or next week. You made an important point that the mMax (max neighbours) affects the update speed when using node->neighbours[]. I believe I did run tests with some bigger mMax values. but for the comparison with previous implementation I only recorded the results of the default parameters. I will gather more data on the impact of mMax and get back to you.


Regarding your comment:

I've re-read the paper a few times and I strongly believe that the graph is only meant to be considered an undirected graph, hence all connections are bi-directional. If you can point out a paragraph to me (or any other source) where the original HNSW allows one-directional links, I am more than happy to reconsider.

Here are a few points supporting the use of one-directional links:

  1. The algorithm 1 in the paper describes the shrink steps:
    image

At line 16, set neighbourhood(e) at layer lc to eNewConn is to shrink only one-direction connection, and it does not mention deleting the connection of the opposite direction.

  1. In addition, PgVector's shrinking step is in this section.
    It also shrinks only one direction.

  2. In the worst case, there could be a scenario where vectors gets isolated:

    • Assume mMax-level0 is 4
    • When (V5) vector 5 is inserted, V4 got isolated.
    • If V4 happens to be entry point of higher level. That means none of V1/2/3/5 will be ever traversed in future insert or search.
    • While increasing mMax could mitigate this, it still reduces the navigability of the graph, in the benchmark tests, I did see lower recall caused by fewer vectors were returned than the specified limit.

image

{
return store_link(graph, layer_number, source_node, source_node);
graph->field[2]->store_binary(
reinterpret_cast<const char *>(neighbor_array), totalSize);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not correct. One can not just store "structs" on disk. Between compilers / architectures the structs might be padded differently. A struct is an in-memory representation that is not guaranteed to be stable on disk. For storing things persistently, one must:

  1. Store things byte-by-byte, explicitly.
  2. Ensure that the storage happens using fixed endinanness. There are uintxkorr functions that can e used.

Note:
This is likely a problem with my first implementation too, when reading floats from disk. We can not assume floats are stored in little endian format. If we are to run on a big-endian machine, the data would not be portable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely a great point Vicentiu! I did understand that's not a portable solution for final implementation. In this commit I was mainly to make a workable version to see if the solution makes sense. I should've made an inline comment for this TODO. Also one aspect I wasn't very familiar is the reference size. As you may noticed in the code, I assumed all references of the same table have the same size (in my test the ref_size seems to be 6 bytes). I can look deeper into understanding how the references operate and maybe a more efficient structure for storing the neighbours.

Currently in the code I'm using this structure to save the neigh_ref:

struct NeighborArray
{
  uint8 num;
  char neigh_ref[];
};

Do you think we need uint8 num here to validate the length of neighbours read from the second table? Or should simply get the size of neighbour list size by neigh_ref/<ref_len got from source table info> ?

}

for (const auto &node: dest_nodes)
// record need update
if (cmp_record(graph, record[1]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This compares full record, even the "uninitialized bytes" of the varbinary column. The low recall I was experiencing with your branch got fixed when removing this line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! I initially had following two lines immediately after reading record to record[1] with ha_index_read_map. Later I was trying to optimize it to avoid storing field2 in some scenarios. But apparently the logic of storing after comparison is incorrect.

    graph->field[2]->store_binary(
      reinterpret_cast<const char *>(neighbor_array), totalSize);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing regarding the recall is that I tried various parameter values last week. I tested your original implementation, your implementation with my changes in this PR, and the implementation in the current PR. However, I always got lower recall rate compared to PGVector using the same parameter sets. There could be something overlooked by us or done differently by PGVector.

@HugoWenTD
Copy link
Contributor Author

Your benchmark shows that we need to be able to parameterize the index construction at runtime so we can expriment without recompiling. Would you be interested in tackling that bit? You can create some system variables for now and we'll use those. A quick and dirty solution, just to map with your already written benchmark code. I noticed you have some commented out SET commands that you wanted to use:

Hi @cvicentiu, I just created this PR #3226 targeting your newly created branch bb-11.4-vec-vicentiu-hugo, for configuring the parameters through system variables. Please be aware that the modification in ann-benchmark is not backward compatible, so I have created a new branch to accommodate this change.

@HugoWenTD
Copy link
Contributor Author

Updating a connection will cause a much larger write in the storage engine. We need to benchmark with longer neighbour lists (100 neighbors max for instance).

Sharing some benchmark results with different m value using DATASET=random-s-100-euclidean. The insert time and search time increased as expected, but still acceptable I think. Will test some more combinations with different ef_construction values.

0: MariaDB(m=100, ef_construction=10, ef_search=10)
recall: 0.846
QPS: 669.383
Insert time : 252.37262177467346

1: MariaDB(m=50, ef_construction=10, ef_search=10)
recall: 0.825
QPS: 678.479
Insert time : 258.6093146800995

2: MariaDB(m=10, ef_construction=10, ef_search=10)
recall: 0.335
QPS:1052.397
Insert time : 192.79994201660156

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
4 participants