Use LMDB to store vectors in PAV backend #378

osma · 2020-01-27T09:54:24Z

Similar to #363; related to #339

When the PAV backend is trained, it sends all documents through the source projects and aggregates their suggestion vectors in memory. This can take up a significant amount of RAM - not quite as much as with nn_ensemble but still several GB in typical YSO settings.

If we instead streamed the vectors to a LMDB database, and then read them back from the LMDB in batches, the backend could scale to much larger training data sets. An additional benefit would be that the LMDB could be retained on disk, so that another training run could be made using the same documents but different hyperparameters (with the --cached option - see #342), without having to process the documents again, so it would be much faster.

LMDB seems to be ideal for this as it is very fast and supports streaming style operations both for reading and writing. It will introduce an additional dependency, though.

osma · 2020-02-04T12:36:27Z

After switching to sparse vectors (#379) in the PAV backend the RAM usage is now much lower, so this is not so crucial anymore.

osma added the enhancement label Jan 27, 2020

osma added this to the Long term milestone Jan 27, 2020

mo-fu mentioned this issue Jan 13, 2022

LMDB can overflow #552

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use LMDB to store vectors in PAV backend #378

Use LMDB to store vectors in PAV backend #378

osma commented Jan 27, 2020 •

edited

Loading

osma commented Feb 4, 2020

Use LMDB to store vectors in PAV backend #378

Use LMDB to store vectors in PAV backend #378

Comments

osma commented Jan 27, 2020 • edited Loading

osma commented Feb 4, 2020

osma commented Jan 27, 2020 •

edited

Loading