New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MDEV-33408 Alter HNSW graph storage and fix memory leak #3197
base: bb-11.4-vec-vicentiu
Are you sure you want to change the base?
MDEV-33408 Alter HNSW graph storage and fix memory leak #3197
Conversation
|
This commit changes the way HNSW graph information is stored in the second table. Instead of storing connections as separate records, it now stores neighbors for each node, leading to significant performance improvements and storage savings. Comparing with the previous approach, the insert speed is 5 times faster, search speed improves by 23%, and storage usage is reduced by 73%, based on ann-benchmark tests with random-xs-20-euclidean and random-s-100-euclidean datasets. Additionally, in previous code, vector objects were not released after use, resulting in excessive memory consumption (over 20GB for building the index with 90,000 records), preventing tests with large datasets. Now ensure that vectors are released appropriately during the insert and search functions. Note there are still some vectors that need to be cleaned up after search query completion. Needs to be addressed in a future commit. All new code of the whole pull request, including one or several files that are either new files or modified ones, are contributed under the BSD-new license. I am contributing on behalf of my employer Amazon Web Services, Inc.
9673c0e
to
437e214
Compare
Assigned to @cvicentiu as it is against his branch. |
Hi @HugoWenTD! Thank you for the contribution. I will address your work in order of appearance:
In any case, I have used it to double check your findings and I can reproduce everything (except the results for this particular PR, but more on this later)
My understanding (at this point in time) is that by "removing only a subset of links", as you've done in #3156 causes a soft increase in MAX_NEIGHBORS_PER_LAYER. The same effect can be achieved by just bumping the MAX_NEIGHBORS_PER_LAYER variable. So I am inclined to reject that bugfix and instead focus on tuning the graph construction parameters.
without any of your changes in PR #3156
So we can achieve 90% recall by having more neighbors per layer and bumping eF_construction. Of course, the QPS drops, as the graph is bigger, but as you've noticed, the code was initially written for correctness, not for speed. Once we've proven corectness, we can look into speed improvements.
Can you double check your findings and see if you've submitted the correct code to this PR? In general I agree with your approach of using an adjancency list This approach has some benefits for sure:
That's all the thoughts I have for now, let me know when you've double checked the benchmark results for your code and in case I messed up when running it, I'll fix it. |
@HugoWenTD I've done a little bit more experimenting with your bugfix from #3156. For certain combinations of parameters, it does seem to offer a marginal improvement in both QPS and recall, but I'd like to test this more systematically. |
Hi @cvicentiu , Thank you for reviewing the pull requests and provide feedback on the benchmarking tool. I really appreciate your input. I'm traveling for meetings this week but I will try to address on the items you mentioned later this week or next week. You made an important point that the mMax (max neighbours) affects the update speed when using node->neighbours[]. I believe I did run tests with some bigger mMax values. but for the comparison with previous implementation I only recorded the results of the default parameters. I will gather more data on the impact of mMax and get back to you. Regarding your comment:
Here are a few points supporting the use of one-directional links: At line 16,
|
{ | ||
return store_link(graph, layer_number, source_node, source_node); | ||
graph->field[2]->store_binary( | ||
reinterpret_cast<const char *>(neighbor_array), totalSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not correct. One can not just store "structs" on disk. Between compilers / architectures the structs might be padded differently. A struct is an in-memory representation that is not guaranteed to be stable on disk. For storing things persistently, one must:
- Store things byte-by-byte, explicitly.
- Ensure that the storage happens using fixed endinanness. There are
uintxkorr
functions that can e used.
Note:
This is likely a problem with my first implementation too, when reading floats from disk. We can not assume floats are stored in little endian format. If we are to run on a big-endian machine, the data would not be portable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is definitely a great point Vicentiu! I did understand that's not a portable solution for final implementation. In this commit I was mainly to make a workable version to see if the solution makes sense. I should've made an inline comment for this TODO. Also one aspect I wasn't very familiar is the reference size. As you may noticed in the code, I assumed all references of the same table have the same size (in my test the ref_size seems to be 6 bytes). I can look deeper into understanding how the references operate and maybe a more efficient structure for storing the neighbours.
Currently in the code I'm using this structure to save the neigh_ref:
struct NeighborArray
{
uint8 num;
char neigh_ref[];
};
Do you think we need uint8 num
here to validate the length of neighbours read from the second table? Or should simply get the size of neighbour list size by neigh_ref/<ref_len got from source table info>
?
} | ||
|
||
for (const auto &node: dest_nodes) | ||
// record need update | ||
if (cmp_record(graph, record[1])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This compares full record, even the "uninitialized bytes" of the varbinary column. The low recall I was experiencing with your branch got fixed when removing this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch! I initially had following two lines immediately after reading record to record[1] with ha_index_read_map
. Later I was trying to optimize it to avoid storing field2 in some scenarios. But apparently the logic of storing after comparison is incorrect.
graph->field[2]->store_binary( reinterpret_cast<const char *>(neighbor_array), totalSize);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing regarding the recall is that I tried various parameter values last week. I tested your original implementation, your implementation with my changes in this PR, and the implementation in the current PR. However, I always got lower recall rate compared to PGVector using the same parameter sets. There could be something overlooked by us or done differently by PGVector.
Hi @cvicentiu, I just created this PR #3226 targeting your newly created branch bb-11.4-vec-vicentiu-hugo, for configuring the parameters through system variables. Please be aware that the modification in ann-benchmark is not backward compatible, so I have created a new branch to accommodate this change. |
Sharing some benchmark results with different 0: MariaDB(m=100, ef_construction=10, ef_search=10) 1: MariaDB(m=50, ef_construction=10, ef_search=10) 2: MariaDB(m=10, ef_construction=10, ef_search=10) |
Description
This commit changes the way HNSW graph information is stored in the second table. Instead of storing connections as separate records, it now stores neighbors for each node, leading to significant performance improvements and storage savings.
Comparing with the previous approach, the insert speed is 5 times faster, search speed improves by 23%, and storage usage is reduced by 73%, based on ann-benchmark tests with random-xs-20-euclidean and random-s-100-euclidean datasets.
Additionally, in previous code, vector objects were not released after use, resulting in excessive memory consumption (over 20GB for building the index with 90,000 records), preventing tests with large datasets.
Now ensure that vectors are released appropriately during the insert and search functions. Note there are still some vectors that need to be cleaned up after search query completion. Needs to be addressed in a future commit.
How can this PR be tested?
Comparing the performance with previous method of saving connections, using ann-benchmark tool, significant improvement on insert/search/storage size can be seen:
To isolate the impact of the storage structure on performance, I basically kept the HNSW parameters consistent in the commit. Only different parameter is a EF_SEARCH=10 which could help improve the recall.
"new" version is based on commit in this pull request and "old" version is of commit 00b613a2 which includes my fix in a previous pending PR. )
With dataset of random-xs-20-euclidean:
t1#i#01.*
:t1#i#01.*
:With dataset of random-s-100-euclidean:
New second table data and index sizes
t1#i#01.*
:Old second table data and index sizes
t1#i#01.*
: