-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvements to pruning #1095
Comments
Thanks for the write-up! I've got some questions:
With read you mean here the actually returning of the content, not to access it? As it does still need to iterate all the elements of the database, right?
I still don't fully understand why we can't delete items until a certain amount of delete data is reached. You could decide how many items to keep sorted based on a total size value to reach? That being said, I do think that alternative 1 and 2 seem more useful, especially because of the more "natural" range queries.
Not fully sure but I think the original idea of using the Do you have any numbers/benchmarks on the queries for the alternatives (not sure which ones you tried already)? It would be nice to have a small benchmark tool with which we can keep testing these operations. |
Yep it still needs to iterate whole db. And it acutally return only tuple
Yep you could do that , but then in case that there is a lot of small elements in database, it would need to read a lot of elements into memory which could with some kinda out of memory error, so you still need some kinda fixed value to constrain maximum of elements which would end up in memory.
Currently As discussed alternatives
After deleting 50MB from 1GB data base, |
I was curious as to why this takes so long, and not knowing how
This is an issue, we can't let users provide a max storage size and then double it when we need to prune. So I believe we will have to move away from the vacuum way. Reading about how deletion and vacuum works, I think it is fine that the database sits at Other path would be auto-vacuum as you have mentioned before. From that page I read that the side effects there are potentially more fragmentation and also that it doesn't compact the partially filled pages. |
Each node in portal network should have configured maximal storage quota. When quota is is reached the node should clean up data which is furthest from local node id and adjust decrease its radius to slowdown ingestion of data.
Current Approach:
Current approach require node to read pre-configured number of elements from database, keeping them sorted in some data structure in order of distance from local node id.
"VACUUM;"
statement, which actually deletes elements form db and repackages it into minimal sizeAlternative1:
Keep additional column
shortId
which stores first 8 most significant bytes of each id as sqlite Integer. It makes possible to use sqlite built in bit wise operations to construct XOR distance -((?1 | storage.short_content_id) - (?1 & storage.short_content_id))
SELECT id WHERE xor(short_id, id) > xxxx AND xor(short_id, id) < xxxx
where xor is((?1 | storage.short_content_id) - (?1 & storage.short_content_id))
((?1 | storage.short_content_id) - (?1 & storage.short_content_id))
is not really readable for someone who will maintain itAlternative 2:
Extend sqlite api to support adding custom functions. Then add custom distance function -
xorDistance
, to compare 32 byte blobs.SELECT id WHERE xorDistance(id, someId) > xxxx AND xorDistance(id, someId) < xxxx
Alternative 3:
Use
xor(local_node_id, content_id)
as primary key in our table.Taking into account that:
VACUUM;
statementFrom all presented alternatives I prefer option
2
, as it makes all of our queries possible and delegate most of responsibilities to sqlite. Performance wise it is not the fastest, but it should not be problemThe text was updated successfully, but these errors were encountered: