Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dask data manager #2461

Merged
merged 14 commits into from
Mar 30, 2017
Merged

Dask data manager #2461

merged 14 commits into from
Mar 30, 2017

Conversation

bjlittle
Copy link
Member

@bjlittle bjlittle commented Mar 28, 2017

Introduces the concept of a generic data manager with a well defined API and strictly enforced contract.

The purpose, and intent, of the data manager is to abstract and contain the state and behaviour required to manage generic iris data payload, be it for a cube, coordinate points, coordinate bounds, cell measure payload etc etc

This is a WIP at the moment ...

  • Complete XXX doc-strings
  • Unit test coverage for iris._data_manager.DataManager._deepcopy method
  • Remove lib/iris/manager.txt file

@bjlittle
Copy link
Member Author

Closes #2386

Copy link
Member

@lbdreyer lbdreyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolute monster of a PR!!!

I've added some comments but this is looking very good!

I want to go over a few things tomorrow but I thought I'd submit this review in the meantime.

if not (hasattr(data, 'shape') and hasattr(data, 'dtype')):
data = np.asanyarray(data)

# Determine whether the __init__ has completed.
Copy link
Member

@lbdreyer lbdreyer Mar 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be reworded to Determine whether the object has been initialised or something like that

"""
# Ensure there is a valid data state.
is_lazy = bool(self._lazy_array is not None)
is_real = bool(self._real_array is not None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do these have to be bool? Surely self._lazy_array is not None already returns True/False

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, good spot ... this was originally actually something else and needed the bool

def _assert_axioms(self):
"""
Definition of the manager state, that should never be violated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth stating the axioms in the doc string or is this repeating too much?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I think it's a bit much. It's a private method, so it's only there for the developer, and they can just read the code - so I'm seeing the doc-string giving some context to the dev not a literal translation of what it actually does.

The number of dimensions covered by the data being managed.

"""
return len(self.shape)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why aren't you just calling ndim?

>>> d = da.from_array(np.arange(6), chunks=1)
>>> print d.ndim
1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, got lazy and mirrored the Cube.ndim method ... sheesh

"""
return self._real_array is not None

def lazy_data(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When do you expect this to be used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The owner of the DataManager will call this. The use case is all over the dask branch wrt the Cube.

Eventually, this will also be the case for coordinates and the like ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I, admittedly, had up to this point only been thinking of this as an independent thing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the behaviour has really been driven by the owner i.e. the usage pattern in the Cube so I'm hoping that once this is merged, the act of integration will prove what we should (and should not) keep in the DataManager API

size = np.cumprod(shape)[-1]
mask_array = ma.arange(size).reshape(shape)
dtype = mask_array.dtype
lazy_array = as_lazy_data(mask_array)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use as_lazy_data here but not in other places?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've addressed this now.

mocker.return_value = return_value
result = copy.deepcopy(dm)
self.assertEqual(mocker.call_count, 1)
self.assertIs(result, return_value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And possibly test that it is passing the memo dictionary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a peek ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, using copy.deepcopy populates the memo (somehow) that's why I dodged that bullet ... I'll see if I can work out what it's doing and extend the test ... seems do-able!

Copy link
Member Author

@bjlittle bjlittle Mar 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm this is actually become some what painful ....

For this test copy.deepcopy populates the memo with the following example:

   (Pdb) list
556  	        return_value = mock.sentinel.return_value
557  	        with mock.patch(method) as mocker:
558  	            mocker.return_value = return_value
559  	            result = copy.deepcopy(dm)
560  	            import pdb; pdb.set_trace()
561  ->	            self.assertEqual(mocker.call_count, 1)
562  	        self.assertIs(result, return_value)
563  	
564  	
565  	class Test_copy(tests.IrisTest):
566  	    def test(self):
(Pdb) p mocker.call_args
call({139777040548944: sentinel.return_value, 139777048606992: [DataManager(array(0))]})

If we want to introspect at this level, then I'm pretty much forced to implement __eq__ and __ne__ comparison methods on the DataManager ... that might be nice to have, but I'm not really motivated to do so just for a test. It may be that there is a genuine use case, say in concatenate or merge or somewhere else that might be super handy, but I'm erring on the side of not doing it here and now .... thoughts @lbdreyer ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting that we don't have tests for cube._deepcopy
Yep, I'm happy for this to be left for someother time. We don't gain much from doing it and I had suspected it would be quite painful. So in the name of diminishing returns I would agree to putting this off.

class Test_replace(tests.IrisTest):
def test_real_with_real(self):
shape = (2, 3, 4)
size = np.cumprod(shape)[-1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You do reuse these so may as well have them in a

def setUp(self):
    self.shape = (2, 3, 4)
    self.size = np.prod(self.shape)

.. note::
Any lazy data being managed will be realised.

"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this doc string need rewording. Usually the Returns section would have the data types.
Perhaps:

Returns the real data. If any lazy data is being managed, it will be realised.

Returns:
    the real :class:`~numpy.ndarray` or   :class:`numpy.ma.core.MaskedArray`

@@ -0,0 +1,94 @@
_lazy_data.py
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably this text file will be removed at some point. Or at least turned into some documentation??

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'll just nuke it

@bjlittle
Copy link
Member Author

@lbdreyer Okay, I think that I've addressed all of your issues apart from the open questions:

  • do we need the has_real_data method
  • is the method lazy_data a property

I still need to update the XXX doc-strings, but all the other material changes, I think, are pretty much there ...


if isinstance(other, type(self)):
result = False
is_lazy = self.has_lazy_data() == other.has_lazy_data()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trouble with this naming is that it suggests that, if is_lazy is True that both self and other have lazy data. But the following could be the case:

>>> print self.has_lazy_data()
False
>>> print other.has_lazy_data()
False
>>> is_lazy = self.has_lazy_data() == other.has_lazy_data()
>>> print is_lazy
True

What about is_lazy_check or both_lazy?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do a name change ... I've almost ran out of creative juice, so thanks for the suggestions 😉

@lbdreyer
Copy link
Member

The ordering of the class methods, properties etc is different to the ordering of the tests.
In the class itself the order is:

__init__
__things with double leading underscore in alphabetical order
_thing with single leading underscore in alphabetical order
The rest in alphabetical order

Should the unit tests also follow this order?

@bjlittle
Copy link
Member Author

@lbdreyer I think this is good to merge now ...

@pelson
Copy link
Member

pelson commented Mar 30, 2017

I have no comment on the implementation, but I want to explore whether this an abstraction too far.
The PR itself obviously only enables the data manager abstraction, rather than making use of the implementation within the Cube, so it isn't clear to us as reviewers how much benefit we are able to get out of this change. Personally, I'd rather reduce our layers of abstraction within Iris and start treating arrays as ducks.

Would you be able to put forward a separate PR that makes use of the DataManager so that we can see how much code it liberates?

@bjlittle
Copy link
Member Author

@pelson That's the follow on PR ...

@pelson
Copy link
Member

pelson commented Mar 30, 2017

@pelson That's the follow on PR ...

😄 - and I appreciate you separating the two PRs! But to add technical complexity without being able to measure that against the simplification it will bring means we are making decisions in the dark.

@pelson
Copy link
Member

pelson commented Mar 30, 2017

Just realised this is against the dask branch. If the intention is for that branch to be reviewed before moving to master, ignore my question about technical complexity - that can be answered in the larger review.

self._assert_axioms()

@property
def core_data(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a note for our future selves, I still like the idea of calling this data. Then we can rename data to real_data and keep lazy_data as lazy_data. I think this is the most descriptive set of names for these variables.

But let's not do that in this PR! One thing at a time.

Copy link
Member Author

@bjlittle bjlittle Mar 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. We should create a ticket in the dask backlog really.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've created a note in the iris-dask-backlog to cover this @dkillick

if not (hasattr(data, 'shape') and hasattr(data, 'dtype')):
data = np.asanyarray(data)

# Determine whether the class instance has been created,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like there should be a better way to check for instantiation than this...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, let me know ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish I knew what it was...

Kwargs:

* realised_dtype:
The intended dtype of the specified lazy data.
Copy link
Member

@lbdreyer lbdreyer Mar 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be worth stating this should only be set for non-float data. Else someone may do:
DataManager(lazy_data, realised_dtype=np.float32)
which would return
Can only cast lazy data to an integer or boolean dtype, got float.
which is the error message from _realised_dtype_setter

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@bjlittle bjlittle assigned bjlittle and unassigned bjlittle Mar 30, 2017
@bjlittle bjlittle changed the title [WIP] Dask data manager Dask data manager Mar 30, 2017
@bjlittle bjlittle moved this from In Progress to In Review in Iris-dask-fill-value-dtype Mar 30, 2017
@bjlittle
Copy link
Member Author

@lbdreyer I've just rebased against upstream/dask ...

realised_dtype=np.dtype('int16'))
self.assertFalse(dm1 == dm2)

def test__non_DataManager_failure(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this have 2 underscores, i.e. why is it not test_non_DataManager_failure

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the type of test, see the dev notes on testing name styles. Not fussed really.

@lbdreyer
Copy link
Member

Thanks @bjlittle and really well done for persevering and not losing your mind in the process 👏 👏 💯 😌

@lbdreyer lbdreyer merged commit 91eaef6 into SciTools:dask Mar 30, 2017
@bjlittle bjlittle moved this from In Review to Done in Iris-dask-fill-value-dtype Mar 30, 2017
@bjlittle bjlittle mentioned this pull request Apr 24, 2017
@bjlittle bjlittle deleted the dask-data-manager branch May 2, 2017 13:02
bjlittle added a commit to bjlittle/iris that referenced this pull request May 31, 2017
Add data manager with tests.
@QuLogic QuLogic modified the milestones: dask, v2.0 Aug 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

5 participants