Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use LMDB to store vectors in PAV backend #378

Open
osma opened this issue Jan 27, 2020 · 1 comment
Open

Use LMDB to store vectors in PAV backend #378

osma opened this issue Jan 27, 2020 · 1 comment
Milestone

Comments

@osma
Copy link
Member

osma commented Jan 27, 2020

Similar to #363; related to #339

When the PAV backend is trained, it sends all documents through the source projects and aggregates their suggestion vectors in memory. This can take up a significant amount of RAM - not quite as much as with nn_ensemble but still several GB in typical YSO settings.

If we instead streamed the vectors to a LMDB database, and then read them back from the LMDB in batches, the backend could scale to much larger training data sets. An additional benefit would be that the LMDB could be retained on disk, so that another training run could be made using the same documents but different hyperparameters (with the --cached option - see #342), without having to process the documents again, so it would be much faster.

LMDB seems to be ideal for this as it is very fast and supports streaming style operations both for reading and writing. It will introduce an additional dependency, though.

@osma osma added this to the Long term milestone Jan 27, 2020
@osma
Copy link
Member Author

osma commented Feb 4, 2020

After switching to sparse vectors (#379) in the PAV backend the RAM usage is now much lower, so this is not so crucial anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant