Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lvremove pitfall/user experience #34

Open
dmlary opened this issue Jun 17, 2020 · 8 comments
Open

lvremove pitfall/user experience #34

dmlary opened this issue Jun 17, 2020 · 8 comments

Comments

@dmlary
Copy link

dmlary commented Jun 17, 2020

This may not be a common experience, but I want to share it, along with a couple of simple solutions to make it a little better, and less terrifying to use lvremove.

I rarely have to remove a logical volume, but every time it happens I know I'm going to fall into this trap. First of all, I never remember the arguments, so I check lvremove -h. The output of that gives an arbitrary VG|LV|Tag|Select ... as the first argument, with no clarification as to what the format of any of these are. There is --longhelp, but it redirects you to a manpage. So invariably I try to run this command (an argument syntax that matches lvcreate):

lvremove my_vg my_lv

This command is absolutely terrifying to run the first time because it will start telling you every lv it cannot delete because they are in use. On top of that, SIGINT is being trapped, so you cannot use CTRL-C to abort this operation. Run this on a system with dozens of LVs, and you're stuck opening another terminal to send a SIGKILL to get control back.

My request here is for two changes:

  • Do not trap SIGINT until the code actually needs to be uninterruptable
  • Add the most common use case(s) of lvremove as an example to the end of the lvremove --help output

These two changes are minor, will take years to reach the environment I work in, but they could make another user's life a little more pleasant.

@bmr-cymru
Copy link
Contributor

The way I remember this is that an LV name on its own is meaningless: it must always be qualified by a VG name. In LVM2 the canonical way of referring to an LV is always vg_name/lv_name. This is the way the LVM tools express the LV to you (and it reflects the directory structure for the LV symlinks in /dev). This applies to LV administration commands like lvresize, lvchange, lvremove etc. as well as reporting and information tools like lvs and lvdisplay. The LV in the tool --help output is referring to this vg_name/lv_name notation (as well as others, but they are not the canonical form).

If you need an online reminder then the man page for lvremove includes one in the EXAMPLES section:

EXAMPLES
       Remove an active LV without asking for confirmation.
       lvremove -f vg00/lvol1

       Remove all LVs the specified VG.
       lvremove vg00

I agree the tool --help output could be a bit more informative here: there was a lot of work done in recent years to make this more consistent and easier to maintain across the whole set of LVM2 commands but it's always a difficult balance to get right.

@bmr-cymru
Copy link
Contributor

I think it would also be helpful if the lvremove vg_name lv_name mis-form produced an immediate error before trying to do anything, since the unqualified lv_name is not a valid LVM2 name.

I would be a bit cautious about SIGKILLing LVM2 processes - they hold locks while operating and may in some cases carry out repairs or other modifications to the device metadata (as well as the operation specified on the command line). It's unlikely but the wrong timing could leave your devices in an inconsistent state.

@zkabelac
Copy link

If there would be any kind of change like that - it would like had to be some sort of 'extra' trap case to check if this is the 'missing /' situation - since in some commands lvm2 may 'deduce' unspecified VG name if it was specified with some other LV (or there is also envvar LVM_VG_NAME which can be used in cases of missing VG name)

So technically lvm2 should do an analysis ahead of starting command - it would figure out the vgname with name of lv_name does not exist - and could probably exit early

@DemiMarie
Copy link

I would be a bit cautious about SIGKILLing LVM2 processes - they hold locks while operating and may in some cases carry out repairs or other modifications to the device metadata (as well as the operation specified on the command line). It's unlikely but the wrong timing could leave your devices in an inconsistent state.

If SIGKILL can leave the devices in an inconsistent state, then so can a system crash, which is bad as it means that lvm2 is not crash-safe.

@zkabelac
Copy link

I would be a bit cautious about SIGKILLing LVM2 processes - they hold locks while operating and may in some cases carry out repairs or other modifications to the device metadata (as well as the operation specified on the command line). It's unlikely but the wrong timing could leave your devices in an inconsistent state.

If SIGKILL can leave the devices in an inconsistent state, then so can a system crash, which is bad as it means that lvm2 is not crash-safe.

Unsure what do you mean by 'crash-safe' app here ?

Lvm2 normally stops signal handling for critical section operations - so we protect against SIGINT, SIGTERM.. and other regular signal.

Clearly NO user-space app can protect itself against SIGKILL (by linux kernel definition)

So if you kill lvm2 command within critical section code processing - you can generally kill your system - i.e. if lvm2 has suspend your rootfs volume - and there is nothing to resume it since lvm2 has been killed.

So yes lvm2 command can stop your working system - and users should never need SIGKILL-it - we do our best to avoid it - but clearly - we are still 'just user-space app'....

@DemiMarie
Copy link

So if you kill lvm2 command within critical section code processing - you can generally kill your system - i.e. if lvm2 has suspend your rootfs volume - and there is nothing to resume it since lvm2 has been killed.

In this case I would argue that this is a kernel bug, inasmuch as killing any process other than PID 1 should not leave the system in an unrecoverable state.

@zkabelac
Copy link

So if you kill lvm2 command within critical section code processing - you can generally kill your system - i.e. if lvm2 has suspend your rootfs volume - and there is nothing to resume it since lvm2 has been killed.

In this case I would argue that this is a kernel bug, inasmuch as killing any process other than PID 1 should not leave the system in an unrecoverable state.

I'm not going to argue about Linux architecture - it's the way it is...

Users should not be 'randomly' killing their process and then wondering their system gots frozen. It's very much the same like to argue about user should not be able to destroy his system with 'rm -rf /*'....

lvm2 is highly privilege process and as such it can kill your machine (mostly equivalent as if it would be kernel task).

Normal user space app typically does not have such ability.

@DemiMarie
Copy link

@zkabelac Okay, valid. That makes lvm2 the equivalent of a storage daemon needed by the rootfs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants