Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why/when does np.something remove the mask of a np.ma array ? #18675

Open
jypeter opened this issue Mar 24, 2021 · 6 comments · May be fixed by #22913
Open

Why/when does np.something remove the mask of a np.ma array ? #18675

jypeter opened this issue Mar 24, 2021 · 6 comments · May be fixed by #22913

Comments

@jypeter
Copy link

jypeter commented Mar 24, 2021

This is obviously more a feature than a bug, otherwise it would have been corrected (I'm using numpy 1.20.1). But it has been bothering me for a very long while, and been the indirect source of several bugs in my (and other colleagues') scripts

There must be some logic behind it, but I have not found it in the documentation. The closest issue I have found is #8881 (an open issue from 4 years ago)

I have a masked array. If I work on it with np.ma functions, things will be fine, but the equivalent function straight from np will silently ignore and remove the mask!

The example below is with hstack, but I get the same problem with vstack, repeat, and probably many other numpy functions

$ conda list | grep numpy
numpy                     1.20.1           py38h18fd61f_0    conda-forge
numpydoc                  1.1.0                      py_1    conda-forge

$ python
Python 3.8.8 | packaged by conda-forge | (default, Feb 20 2021, 16:22:27)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> a = np.ma.arange(4)
>>> a[2] = np.ma.masked

>>> a
masked_array(data=[0, 1, --, 3],
             mask=[False, False,  True, False],
       fill_value=999999)

>>> b_ma = np.ma.hstack((a, a))
>>> b_ma
masked_array(data=[0, 1, --, 3, 0, 1, --, 3],
             mask=[False, False,  True, False, False, False,  True, False],
       fill_value=999999)

>>> b_NOma = np.hstack((a, a))
>>> b_NOma
masked_array(data=[0, 1, 2, 3, 0, 1, 2, 3],
             mask=False,
       fill_value=999999)
>>>

On the other hand, some functions fortunately use and keep the mask, regardless of being taken from np or np.ma

>>> np.exp(b_ma)
masked_array(data=[1.0, 2.718281828459045, --, 20.085536923187668, 1.0,
                   2.718281828459045, --, 20.085536923187668],
             mask=[False, False,  True, False, False, False,  True, False],
       fill_value=999999)
>>> np.ma.exp(b_ma)
masked_array(data=[1.0, 2.718281828459045, --, 20.085536923187668, 1.0,
                   2.718281828459045, --, 20.085536923187668],
             mask=[False, False,  True, False, False, False,  True, False],
       fill_value=999999)

So, is this a bug or a feature? What is the logic (so that I can tell our students), and where is it clearly (for beginners) explained?

Even if this is not a bug, I think it would be much safer if numpy functions working on masked array would always use the mask and return a masked array

@jypeter jypeter changed the title Why/when does np.something remove the mask of a np.ma.array ? Why/when does np.something remove the mask of a np.ma array ? Mar 24, 2021
@seberg
Copy link
Member

seberg commented Mar 24, 2021

This is obviously more a feature than a bug, otherwise it would have been corrected (I'm using numpy 1.20.1).

I would not consider it either a feature or a bug... But there is a fundamental problem and unfortunately there is no solution for "fixing" masked arrays. Masked arrays work well for many things, but the limitations are of course very clear to you. Maybe the point is that MaskedArrays are not considered part of the "core" of NumPy due to these problems, so the NumPy functions don't come with a guarantee to work with masked arrays.

There is currently consensus that these issues cannot be fixed in NumPy. The first thought might be to fix MaskedArray itself. But it would be a huge undertaking, that is bound to break backward compatibility. The other idea was adding proper "masked" support to the core of NumPy, but that is a huge project and has its own problem as well (this has been attempted a decade ago but ultimately failed for various reasons).

The good news is that we are in a pretty good position right now for a "better" MaskedArray! Something that was not possible when it was first written.
But such a project does not require implementation in NumPy itself. So the current consensus (even mentioned on the roadmap briefly) is that it should be a project started outside of NumPy proper. (I would be happy to host it on the NumPy organization if that helps pushing it.)

There are NumPy core-devs interested in such a project, for example @ahaldane. I am not sure how far they are along or if you are interested in contributing/looking at it.


As to what to do in NumPy more concretely and explaining the "state". Many NumPy functions will call np.asarray which effectively drops the mask, there is not always a simple logic to it, in some cases a function might just need to be changed from np.asarray to np.asanyarray and work.

One set of functions that always works I think, are the NumPy ufuncs/math functions isinstance(function, np.ufunc). But generally, you are only safe if you use the np.ma.* functions.

The only thing I could think of to actually improve the situation would be to tag on a warning when something calls np.asarray(masked_array), but even that might be tricky and probably not quite sufficient...

@jypeter
Copy link
Author

jypeter commented Mar 30, 2021

Thank you for the clarification! I thought masked arrays were more tightly integrated in numpy than that. And we are probably all convinced that it is better to handle explicitly missing values than relying on nan stuff

Seen from outside (my user point of view), it seems that we are almost there. We should be working seamlessly with np and np.ma data, which is the case most of the time, but the side effects we get when masked data is suddenly cast back to non masked data can be both surprising and dangerous (if they go unnoticed)

As a regular user of np and np.ma, I can now spot errors when they seem to come from a lost mask, but it's not always easy

I hope something gets done about this at some point. In the meantime, mentioning something (without having it seem like a bug) in all the appropriate places in the documentation may help, if people read it. Or something that can be found with google.

Mentioning clearly masked arrays in NumPy user guide could help

Also, adding a warning when np.asarray or similar functions silently drop the mask would create a much safer world!

@cooperrc
Copy link
Member

Thank you for the clarification! I thought masked arrays were more tightly integrated in numpy than that. And we are probably all convinced that it is better to handle explicitly missing values than relying on nan stuff

Seen from outside (my user point of view), it seems that we are almost there. We should be working seamlessly with np and np.ma data, which is the case most of the time, but the side effects we get when masked data is suddenly cast back to non masked data can be both surprising and dangerous (if they go unnoticed)

As a regular user of np and np.ma, I can now spot errors when they seem to come from a lost mask, but it's not always easy

I hope something gets done about this at some point. In the meantime, mentioning something (without having it seem like a bug) in all the appropriate places in the documentation may help, if people read it. Or something that can be found with google.

Mentioning clearly masked arrays in NumPy user guide could help

Also, adding a warning when np.asarray or similar functions silently drop the mask would create a much safer world!

@jypeter, for some suggested action items here, it would be great if you could share one of your use cases with the explanation included above. Your experience would help many new and seasoned NumPy users. Would you be willing to add either:

A longer endeavor would be the suggested warnings when asarray ignores the mask. I think a quicker solution is to point new numpy.ma users to a "How-to" or "tutorial".

@matteo-pallini
Copy link

matteo-pallini commented Sep 24, 2021

Thanks for the explanations.

From a user perspective I also think that it would be much safer to have a warning whenever np.asarray is called.

I ended up getting burned by using np.copy rather than np.ma.copy. It took me a bit to figure out where the issue was, I think that a warning would have sped up that process

@rgommers
Copy link
Member

rgommers commented Mar 5, 2023

The only thing I could think of to actually improve the situation would be to tag on a warning when something calls np.asarray(masked_array), but even that might be tricky and probably not quite sufficient...

This is actually quite a footgun, so maybe we should just deprecate and then remove implicit conversion? When a user knows what they are doing, they can use the .data and .mask.

@rgommers rgommers added this to the 1.25.0 release milestone Mar 5, 2023
@jypeter
Copy link
Author

jypeter commented Mar 7, 2023

If only users knew what they were doing and read the documentation!

We have to make sure that things work implicitly as expected (in our case, masks are used and carried on, if present), or that there are builtin safeguards (i.e. warning messages that will help the users get what they want). Things should be bulletproof (idiot-proof, for some of our lazy users...)

But I'm not the person who can do the required coding

As I said, masks (and masked arrays) are a great feature (I hate NaNs) when they work as expected

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants