You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When the PAV backend is trained, it sends all documents through the source projects and aggregates their suggestion vectors in memory. This can take up a significant amount of RAM - not quite as much as with nn_ensemble but still several GB in typical YSO settings.
If we instead streamed the vectors to a LMDB database, and then read them back from the LMDB in batches, the backend could scale to much larger training data sets. An additional benefit would be that the LMDB could be retained on disk, so that another training run could be made using the same documents but different hyperparameters (with the --cached option - see #342), without having to process the documents again, so it would be much faster.
LMDB seems to be ideal for this as it is very fast and supports streaming style operations both for reading and writing. It will introduce an additional dependency, though.
The text was updated successfully, but these errors were encountered:
Similar to #363; related to #339
When the PAV backend is trained, it sends all documents through the source projects and aggregates their suggestion vectors in memory. This can take up a significant amount of RAM - not quite as much as with nn_ensemble but still several GB in typical YSO settings.
If we instead streamed the vectors to a LMDB database, and then read them back from the LMDB in batches, the backend could scale to much larger training data sets. An additional benefit would be that the LMDB could be retained on disk, so that another training run could be made using the same documents but different hyperparameters (with the --cached option - see #342), without having to process the documents again, so it would be much faster.
LMDB seems to be ideal for this as it is very fast and supports streaming style operations both for reading and writing. It will introduce an additional dependency, though.
The text was updated successfully, but these errors were encountered: