Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFE: document why lvm2 has to revalidate the metadata for every command #74

Open
DemiMarie opened this issue Mar 25, 2022 · 16 comments
Open

Comments

@DemiMarie
Copy link

lvm2 having to revalidate metadata for every command is highly non-obvious, and not understanding the reasons behind it leads to confusion as to why lvm2’s shell mode doesn’t perform seemingly obvious optimizations.
#65 (comment) has some explanation, but I would prefer this to be in the lvm2 documentation.

@zkabelac
Copy link

We have seen to many problems over the history of linux & lvm2 devel - that having this validation always ON was never a big deal. It protects us from releasing bad code, and capture various bugs we have seen at various stages (kernel, virtualization, users providing us with invalid hand made data...)

Of course if the user is processing massive metadata size - this can be consider for 'optional' feature of having an configurable knob to enable deep validation.

However lvm2 was never really designed to work on multi MiB metadata sets - so cutting of 'validation' time is not a 'major' time rescuer either...

@DemiMarie
Copy link
Author

Would it be possible to document this in man lvm or similar?

@zkabelac
Copy link

Document what exactly - Lvm2 slows down with increase of metadata size ?
The bigger the size is - the slower the command execution gets - seems expectable .

But patches are always welcome to enhance readability and understanding..

@DemiMarie
Copy link
Author

Document what exactly - Lvm2 slows down with increase of metadata size ? The bigger the size is - the slower the command execution gets - seems expectable .

That lvm2 can’t perform the seemingly obvious optimizations I have asked about on the mailing list and on GitHub, such as caching the metadata between commands in the shell.

@zkabelac
Copy link

Document what exactly - Lvm2 slows down with increase of metadata size ? The bigger the size is - the slower the command execution gets - seems expectable .

That lvm2 can’t perform the seemingly obvious optimizations I have asked about on the mailing list and on GitHub, such as caching the metadata between commands in the shell.

This is not easy to answer - historical primary goal of lvm2 has been always bullet proof correctness - so whenever lvm2 drops the VG lock - after reacquiring the lock back - we always reload an validate all data directly from disk (and you wouldn't probably believe how various VM machines suffered from many kinds of disk errors...)

So the 'rapid' fire of lvm2 commands was never a primary goal - where the 'temporary' caching of live 'metadata' in RAM would be the 'primary' source of VG info (instead of written content on disk).

I could see some potential use-case but so far there never was any 'customer' behind such request.
(lvm2 being a volume manager to manage 'volumes' as long-time living objects - there never was any plan so far to work with some 'short' living volumes where the time of command would actually be a thing to consider....)

Also if the 'data' are not validated from being obtained from 'disk' - the possibly risk of big data loss is getting bigger - although there can be environments where such 'risk' could be taken to perform commands in much higher rate.

@DemiMarie
Copy link
Author

This is not easy to answer - historical primary goal of lvm2 has been always bullet proof correctness - so whenever lvm2 drops the VG lock - after reacquiring the lock back - we always reload an validate all data directly from disk (and you wouldn't probably believe how various VM machines suffered from many kinds of disk errors...)

That makes a LOT of sense. I was hoping for a “never drop the lock” mode, but at that point I am not sure if dm-thin is the right choice. For instance, it has a limit at 2^24 transactions.

@zkabelac
Copy link

This is not easy to answer - historical primary goal of lvm2 has been always bullet proof correctness - so whenever lvm2 drops the VG lock - after reacquiring the lock back - we always reload an validate all data directly from disk (and you wouldn't probably believe how various VM machines suffered from many kinds of disk errors...)

That makes a LOT of sense. I was hoping for a “never drop the lock” mode, but at that point I am not sure if dm-thin is the right choice. For instance, it has a limit at 2^24 transactions.

Certainly if you do plan to write a disk managment system to control a Universe - it's a seriously limiting factor, but ATM we have never faced any problems with this limitation in any real-life customer cases.

Surely for higher scaled range of transactions you would need to find some other products (although I admit I'd curious to know about them myself and how they are comparable in performance).

Also note - there are other limitation factors like fragmentation of data and metadata with current thin-pool format v1.5 - if you would have plan to make such massive deployment (some of them going to be addressed with upcoming v2.0) .

@DemiMarie
Copy link
Author

(some of them going to be addressed with upcoming v2.0)

Has there been any work on this that I can see?

@zkabelac
Copy link

The best I can advice is to look at: https://github.com/jthornber

But it will take still couple months before getting this fully upstream....

@DemiMarie
Copy link
Author

The best I can advice is to look at: https://github.com/jthornber

But it will take still couple months before getting this fully upstream....

I see. Will that fix QubesOS/qubes-issues#3244?

@zkabelac
Copy link

Well in the 1st. place I do not exactly understand what is your exact problem with 24bit value.
This snapshot (or rather thin-volume id) could be easily wrapped around - it has no other big meaning than to identify things together. So unless you dream to use 2^24 thin volumes at the same time from a single thin-pool - I don't see a problem.

Of course the size of thin-pool metadata is way many times bigger problem - and that limitation should be getting better with newer format - as well as better support for smaller chunk sizes then 64k.

@DemiMarie
Copy link
Author

Well in the 1st. place I do not exactly understand what is your exact problem with 24bit value.
This snapshot (or rather thin-volume id) could be easily wrapped around - it has no other big meaning than to identify things together. So unless you dream to use 2^24 thin volumes at the same time from a single thin-pool - I don't see a problem.

According to the reporter, there was an unchecked overflow, which would result in other data being corrupted. Also, what if thin 0 was still in use when the counter wrapped? It’s not too hard to imagine this situation, since often the first thin is the root filesystem, which will always be in use.

@DemiMarie
Copy link
Author

DemiMarie commented Mar 31, 2022

Of course the size of thin-pool metadata is way many times bigger problem - and that limitation should be getting better with newer format - as well as better support for smaller chunk sizes then 64k.

This will be very helpful for Qubes OS. My suspicion is that Qubes OS is very much a breaking-sharing intensive workload, especially with a CoW filesystem inside the guest. Right now initializing writes seem to be slower than writes to preallocated space by a factor of 2 or more in a trivial dd benchmark.

@zkabelac
Copy link

Each thin volume has its DeviceID (24bit) - but that's about it. It keeps the list of associated chunks - snapshot is just like any other thin LV - difference between 'regular' thin LV and snapshot thin LV is - it's started with pre-populated mappings.

Of course it's up-to your user-space app to ensure you are not 'using' existing DeviceId when you are making new thinLV - but as said - it's unsupported to have >2^24 thinLV in a single thin-pool - if that's what you want - you need to seek for other solution.

And yes - block provisioning IS expensive - the added value of thin-pool usage is not coming for free....

Also - if you are oriented at 'dd' benchmarking - setting proper buffer size & direct write - all that matters.
And also - the bigger the chunk is - the better the performance gets (but with less efficient snapshoting)

@DemiMarie
Copy link
Author

Each thin volume has its DeviceID (24bit) - but that's about it. It keeps the list of associated chunks - snapshot is just like any other thin LV - difference between 'regular' thin LV and snapshot thin LV is - it's started with pre-populated mappings.

The DeviceID isn’t what I am referring to. I am referring to the timestamp that dm-thin uses internally to know if it needs to break sharing.

@zkabelac
Copy link

Each thin LV has some 'mapping' bTree - as soon as it needs to 'write' to any of its chunk - it needs to own such chunk exclusively - not sure how the timestamp may affect this...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants