Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved block serialization format #3946

Closed
ShadowNinja opened this issue Apr 1, 2016 · 9 comments
Closed

Improved block serialization format #3946

ShadowNinja opened this issue Apr 1, 2016 · 9 comments

Comments

@ShadowNinja
Copy link
Member

ShadowNinja commented Apr 1, 2016

Recently, while writing a utility that serializes MapBlocks and writes then to a database I noticed that there was some room for improvement for the format:

  • Compress the whole block at once, instead of separately compressing all the large sections. This is simpler and may improve performance and increase compression ratios.
  • Reorder node storage to YZX or YXZ order to improve compression (blocks are likely to have similar nodes in the same Y layer).
  • Add a binary Inventory list serialization format. Currently node metadata inventories are serialized in the same multi-line text format used in player files. It's compressed, but even an uncompressed binary format is probably smaller than the compressed text. Proposed format:
    • u8 version = 0
    • u8 flags { FLAG_EMPTY } Perhaps combine with version? Empty inventories are quite common, so this flag is probably worth having. Alternatively, have separate size and "included size" fields -- or store items as a Map<SlotID, Item>.
    • u8 width
    • u16 size
    • for each item (if not empty):
      • u16 count (if 0 continue)
      • u8/u16 name_size
      • char name[name_size]
  • Split content array? Currently it's just serialized as an array of u16, however, this allows up to (216) nodes per block, and only different ids are actually (212) possible, wasting at least 4 bits. Also, most blocks won't have anywhere near 212 different nodes. Many will have less than 8. To reduce this waste, the content could be serialized as an array of u8, with an optional array of u4 (conceptually) included if there are more than 256 distinct nodes in a block. This might be pointless if compression does a good job though.
  • Omit IDs from name-id-map: The IDs are just serialized as 0, 1, 2, etc, so just serialize an array of strings and use the index as the ID.
  • Consider better compression levels? (or use a setting?)
  • Remove content_width/params_width these are set to constants and ignored. Changing then would break the format anyways.
  • Don't include timer data length if version is 0 (in fact, perhaps it should be completely removed).
  • Use bit shifts instead of multiplication for meta/timer positions (simpler and more efficient).
@kwolekr
Copy link
Contributor

kwolekr commented Apr 1, 2016

Sure that all sounds good, but this is a longer-term enhancement that should probably come below getting other PRs merged in terms of priority.

@est31
Copy link
Contributor

est31 commented Apr 1, 2016

  1. Compressing the whole block at once: I'm not sure that this has the desired effect. Its better to have separate huffman trees for separate kinds of data.
  2. Reorder coordinates to YZX/YXZ: makes sense, but first test it whether it actually improves the ratio
  3. Add a binary Inventory list serialization format: good idea, 👍 . We should however still use the textual format for the player files, to keep it human-editable.
  4. Split content array: We would be doing the work the huffman tree already does (or is supposed to do), I doubt it adds any improvements.
  5. Ommit IDs from name-id-map: I'm not sure, but doesn't the current code write the name-id-map directly from what it has from the nodedef, not replacing it with 0,1,2,3, etc, but taking the higher IDs? If we omit the IDs then we must use 0,1,2,3, etc and this means we need an extra walk over the param0 array to translate the IDs to that format. What about throwing out the whole name-id-map-per-block concept, and making it a name-id-map-per-world one? Or multiple name-id-maps, and the block uses one specified by an id, to be used if you added a mod that added lots of nodes, but now chose another one that adds a lot of nodes, and the two node sets don't fit into one name id map.
  6. Consider better compression levels: minetest is supposed to be a real time game, so compression/decompression needs to be fast. I don't think anybody wants this

@ShadowNinja
Copy link
Member Author

@est31:

  1. Will have to test then.
  2. Well, MC changed to YZX for this same reason, so it seems like it does, but I could test it.
  3. Yes, only meant for MapBlocks (although other places where they're sent over the network could benefit).
  4. Actually, this is already implemented as content_width, it's just always set to 2 even when only 1 is needed. Changing this should be rather trivial.
  5. getBlockNodeIdMapping generates a list specific to the MapBlock (it uses a global, but resets it on each call).
  6. If you have plenty of storage but saving is taking too long you could reduce/remove compression too with a setting, but I'm pretty sure serialization runs infrequently in a background thread and performance isn't much of an issue.

EDIT: getBlockNodeIdMapping generates a sepate mapping per block.

@o-jasper
Copy link

o-jasper commented Apr 8, 2016

I think you should consider merkle-tree-like structures first. Basically the hash of subsets of the map/data are hashed together in a tree. If a branch has the wrong hash, you ask for the sub-branch hashes, and then there must be at least one wrong there. You go down finding all the changes upto some granularity that is convenient and efficient.

Could prioritize keeping/not keeping chunks on hard disk by how long the player spends near those areas. Then when (s)he logs in, very little has to be sent.

(edit: as a next step)In case of changes nearby, could try for a synchronized simulation in that case, sending across events by users/server and using the hash to catch when the synchronization fails. That could be due to floating point differences or "edge of what users simulate" difference. (If floating points "diverge in small amounts too quickly" can use for the hash floor(float/small_value) allowing the small differences. Of course, depending the case, is likely the difference would grow.

Lua tables/object data can also be hashed that way. Got something sortah for that, but not quite..

Edit: simplest, is to simply have hashes calculated with each chunk. Main disadvantage is that you have to iterate and hash the whole thing for each change. So it is likely better to have a bunch of sub-chunks with hashes. If one block changes, that 4³ blocks + 4³ subhashes = 128 instead of 16³ = 4096 blocks. Also can just hash occasionally, for instance when changes start, wait 5ms until hashing it, so not every change triggers rehashing.

Also note that users can put whatever nodes they want. Malicious ones try find collisions causing desynchronizations. Probably not an issue though. Also springrts, had desync problems until they had this. At least, how i understand it.

@ShadowNinja
Copy link
Member Author

@o-jasper: What are you talking about? It doesn't sound like what you're saying has anything to do with MapBlock serialization.

@o-jasper
Copy link

Not if you're serializing to disk, but if you're serializing for sending across the network to players, it might be premature optimization?

@HybridDog
Copy link
Contributor

The mapblock data is similar to a 3D image with colour palette, so I think that compression algorithms for image formats could be used for mapblock compression, too.
The FLIF image format has been superseded by FUIF, which has been superseded by JPEG XL.
The file-size-relevant part of the Minetest map corresponds to a 3D image with colour palette (the different nodes) and three channels (nodes, param1, param2). The Meta-Adaptive Near-zero Integer Arithmetic Coding performs significantly better for indexed images than gzip compression I think. Someone could have a look at the relevant parts of the JPEG XL code (lossless compression, indexed mode) and then write an efficient compression algorithm for the Minetest maps.

@rubenwardy
Copy link
Member

This should either be accepted and implemented or closed

@sfan5
Copy link
Member

sfan5 commented May 5, 2021

Remove content_width/params_width these are set to constants and ignored. Changing then would break the format anyways.

This isn't true btw.
It is perfectly possible to write a valid block that can be successfully/correctly decoded with content_width = 1 and the latest serialization version. Because this trick could help with compression (not in MT itself, but external tools) I don't think it hurts to keep.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants