Dask data manager #2461

bjlittle · 2017-03-28T08:33:01Z

Introduces the concept of a generic data manager with a well defined API and strictly enforced contract.

The purpose, and intent, of the data manager is to abstract and contain the state and behaviour required to manage generic iris data payload, be it for a cube, coordinate points, coordinate bounds, cell measure payload etc etc

This is a WIP at the moment ...

Complete XXX doc-strings
Unit test coverage for iris._data_manager.DataManager._deepcopy method
Remove lib/iris/manager.txt file

bjlittle · 2017-03-28T08:35:42Z

Closes #2386

lbdreyer

Absolute monster of a PR!!!

I've added some comments but this is looking very good!

I want to go over a few things tomorrow but I thought I'd submit this review in the meantime.

lbdreyer · 2017-03-28T10:32:15Z

lib/iris/_data_manager.py

+ if not (hasattr(data, 'shape') and hasattr(data, 'dtype')):
+ data = np.asanyarray(data)
+
+ # Determine whether the __init__ has completed.


I think this should be reworded to Determine whether the object has been initialised or something like that

lbdreyer · 2017-03-28T10:55:48Z

lib/iris/_data_manager.py

+ """
+ # Ensure there is a valid data state.
+ is_lazy = bool(self._lazy_array is not None)
+ is_real = bool(self._real_array is not None)


Why do these have to be bool? Surely self._lazy_array is not None already returns True/False

Yup, good spot ... this was originally actually something else and needed the bool

lbdreyer · 2017-03-28T11:07:04Z

lib/iris/_data_manager.py

+ def _assert_axioms(self):
+ """
+ Definition of the manager state, that should never be violated.
+


Is it worth stating the axioms in the doc string or is this repeating too much?

Personally, I think it's a bit much. It's a private method, so it's only there for the developer, and they can just read the code - so I'm seeing the doc-string giving some context to the dev not a literal translation of what it actually does.

lbdreyer · 2017-03-28T12:12:55Z

lib/iris/_data_manager.py

+ The number of dimensions covered by the data being managed.
+
+ """
+ return len(self.shape)


Why aren't you just calling ndim?

>>> d = da.from_array(np.arange(6), chunks=1) >>> print d.ndim 1

Yup, got lazy and mirrored the Cube.ndim method ... sheesh

lbdreyer · 2017-03-28T12:23:25Z

lib/iris/_data_manager.py

+ """
+ return self._real_array is not None
+
+ def lazy_data(self):


When do you expect this to be used?

The owner of the DataManager will call this. The use case is all over the dask branch wrt the Cube.

Eventually, this will also be the case for coordinates and the like ...

Makes sense. I, admittedly, had up to this point only been thinking of this as an independent thing.

Yeah, the behaviour has really been driven by the owner i.e. the usage pattern in the Cube so I'm hoping that once this is merged, the act of integration will prove what we should (and should not) keep in the DataManager API

lbdreyer · 2017-03-28T17:00:23Z

lib/iris/tests/unit/data_manager/test_DataManager.py

+ size = np.cumprod(shape)[-1]
+ mask_array = ma.arange(size).reshape(shape)
+ dtype = mask_array.dtype
+ lazy_array = as_lazy_data(mask_array)


Why use as_lazy_data here but not in other places?

I've addressed this now.

lbdreyer · 2017-03-28T17:19:04Z

lib/iris/tests/unit/data_manager/test_DataManager.py

+ mocker.return_value = return_value
+ result = copy.deepcopy(dm)
+ self.assertEqual(mocker.call_count, 1)
+ self.assertIs(result, return_value)


And possibly test that it is passing the memo dictionary?

I'll take a peek ...

Yeah, using copy.deepcopy populates the memo (somehow) that's why I dodged that bullet ... I'll see if I can work out what it's doing and extend the test ... seems do-able!

Hmmm this is actually become some what painful ....

For this test copy.deepcopy populates the memo with the following example:

(Pdb) list 556 return_value = mock.sentinel.return_value 557 with mock.patch(method) as mocker: 558 mocker.return_value = return_value 559 result = copy.deepcopy(dm) 560 import pdb; pdb.set_trace() 561 -> self.assertEqual(mocker.call_count, 1) 562 self.assertIs(result, return_value) 563 564 565 class Test_copy(tests.IrisTest): 566 def test(self): (Pdb) p mocker.call_args call({139777040548944: sentinel.return_value, 139777048606992: [DataManager(array(0))]})

If we want to introspect at this level, then I'm pretty much forced to implement __eq__ and __ne__ comparison methods on the DataManager ... that might be nice to have, but I'm not really motivated to do so just for a test. It may be that there is a genuine use case, say in concatenate or merge or somewhere else that might be super handy, but I'm erring on the side of not doing it here and now .... thoughts @lbdreyer ?

Interesting that we don't have tests for cube._deepcopy
Yep, I'm happy for this to be left for someother time. We don't gain much from doing it and I had suspected it would be quite painful. So in the name of diminishing returns I would agree to putting this off.

lbdreyer · 2017-03-28T17:22:20Z

lib/iris/tests/unit/data_manager/test_DataManager.py

+class Test_replace(tests.IrisTest):
+ def test_real_with_real(self):
+ shape = (2, 3, 4)
+ size = np.cumprod(shape)[-1]


You do reuse these so may as well have them in a

def setUp(self): self.shape = (2, 3, 4) self.size = np.prod(self.shape)

lbdreyer · 2017-03-28T18:05:44Z

lib/iris/_data_manager.py

+ .. note::
+ Any lazy data being managed will be realised.
+
+ """


I think this doc string need rewording. Usually the Returns section would have the data types.
Perhaps:

Returns the real data. If any lazy data is being managed, it will be realised. Returns: the real :class:`~numpy.ndarray` or :class:`numpy.ma.core.MaskedArray`

lbdreyer · 2017-03-28T18:13:55Z

lib/iris/manager.txt

@@ -0,0 +1,94 @@
+_lazy_data.py


Presumably this text file will be removed at some point. Or at least turned into some documentation??

Yeah, I'll just nuke it

bjlittle · 2017-03-29T15:18:56Z

@lbdreyer Okay, I think that I've addressed all of your issues apart from the open questions:

do we need the has_real_data method
is the method lazy_data a property

I still need to update the XXX doc-strings, but all the other material changes, I think, are pretty much there ...

lbdreyer · 2017-03-29T15:32:09Z

lib/iris/_data_manager.py

+
+ if isinstance(other, type(self)):
+ result = False
+ is_lazy = self.has_lazy_data() == other.has_lazy_data()


Trouble with this naming is that it suggests that, if is_lazy is True that both self and other have lazy data. But the following could be the case:

>>> print self.has_lazy_data() False >>> print other.has_lazy_data() False >>> is_lazy = self.has_lazy_data() == other.has_lazy_data() >>> print is_lazy True

What about is_lazy_check or both_lazy?

I'll do a name change ... I've almost ran out of creative juice, so thanks for the suggestions 😉

lbdreyer · 2017-03-29T15:44:49Z

The ordering of the class methods, properties etc is different to the ordering of the tests.
In the class itself the order is:

__init__
__things with double leading underscore in alphabetical order
_thing with single leading underscore in alphabetical order
The rest in alphabetical order

Should the unit tests also follow this order?

bjlittle · 2017-03-30T06:12:01Z

@lbdreyer I think this is good to merge now ...

pelson · 2017-03-30T06:30:27Z

I have no comment on the implementation, but I want to explore whether this an abstraction too far.
The PR itself obviously only enables the data manager abstraction, rather than making use of the implementation within the Cube, so it isn't clear to us as reviewers how much benefit we are able to get out of this change. Personally, I'd rather reduce our layers of abstraction within Iris and start treating arrays as ducks.

Would you be able to put forward a separate PR that makes use of the DataManager so that we can see how much code it liberates?

bjlittle · 2017-03-30T06:53:15Z

@pelson That's the follow on PR ...

pelson · 2017-03-30T06:57:19Z

@pelson That's the follow on PR ...

😄 - and I appreciate you separating the two PRs! But to add technical complexity without being able to measure that against the simplification it will bring means we are making decisions in the dark.

pelson · 2017-03-30T07:46:23Z

Just realised this is against the dask branch. If the intention is for that branch to be reviewed before moving to master, ignore my question about technical complexity - that can be answered in the larger review.

DPeterK · 2017-03-30T08:24:18Z

lib/iris/_data_manager.py

+ self._assert_axioms()
+
+ @property
+ def core_data(self):


As a note for our future selves, I still like the idea of calling this data. Then we can rename data to real_data and keep lazy_data as lazy_data. I think this is the most descriptive set of names for these variables.

But let's not do that in this PR! One thing at a time.

Agreed. We should create a ticket in the dask backlog really.

I've created a note in the iris-dask-backlog to cover this @dkillick

DPeterK · 2017-03-30T08:25:28Z

lib/iris/_data_manager.py

+ if not (hasattr(data, 'shape') and hasattr(data, 'dtype')):
+ data = np.asanyarray(data)
+
+ # Determine whether the class instance has been created,


It feels like there should be a better way to check for instantiation than this...

Please, let me know ...

I wish I knew what it was...

lbdreyer · 2017-03-30T09:39:15Z

lib/iris/_data_manager.py

+ Kwargs:
+
+ * realised_dtype:
+ The intended dtype of the specified lazy data.


I think it might be worth stating this should only be set for non-float data. Else someone may do:
DataManager(lazy_data, realised_dtype=np.float32)
which would return
Can only cast lazy data to an integer or boolean dtype, got float.
which is the error message from _realised_dtype_setter

bjlittle · 2017-03-30T10:42:40Z

@lbdreyer I've just rebased against upstream/dask ...

lbdreyer · 2017-03-30T11:09:01Z

lib/iris/tests/unit/data_manager/test_DataManager.py

+ realised_dtype=np.dtype('int16'))
+ self.assertFalse(dm1 == dm2)
+
+ def test__non_DataManager_failure(self):


why does this have 2 underscores, i.e. why is it not test_non_DataManager_failure

It's the type of test, see the dev notes on testing name styles. Not fussed really.

lbdreyer · 2017-03-30T12:59:22Z

Thanks @bjlittle and really well done for persevering and not losing your mind in the process 👏 👏 💯 😌

Add data manager with tests.

bjlittle added Status: Work in Progress dask labels Mar 28, 2017

bjlittle added this to the dask milestone Mar 28, 2017

bjlittle added this to In Progress in Iris-dask-fill-value-dtype Mar 28, 2017

lbdreyer requested changes Mar 28, 2017

View reviewed changes

lbdreyer reviewed Mar 29, 2017

View reviewed changes

bjlittle force-pushed the dask-data-manager branch from 9efa5d7 to d8b89e8 Compare March 29, 2017 22:28

DPeterK reviewed Mar 30, 2017

View reviewed changes

lbdreyer reviewed Mar 30, 2017

View reviewed changes

bjlittle assigned bjlittle and unassigned bjlittle Mar 30, 2017

bjlittle changed the title ~~[WIP] Dask data manager~~ Dask data manager Mar 30, 2017

bjlittle moved this from In Progress to In Review in Iris-dask-fill-value-dtype Mar 30, 2017

bjlittle added 8 commits March 30, 2017 11:40

wip

0997c9e

wip

bad3a86

wip

c663bd6

wip

135a1e0

added more tests

826f77f

Added more tests

74b3026

Review actions.

aa4b1c1

Fix exception

f5c8c0d

bjlittle added 5 commits March 30, 2017 11:40

Fix python3 Test___deepcopy__ + doc-strings

db899be

Rebrand variables within axioms

15e6959

Reorder unit tests

6cfc736

Purge has_real_data method

76287e2

Update __init__ realised_dtype doc-string.

77a44c2

bjlittle force-pushed the dask-data-manager branch from 5a48c75 to 77a44c2 Compare March 30, 2017 10:41

lbdreyer reviewed Mar 30, 2017

View reviewed changes

Test rename

28fcef8

lbdreyer approved these changes Mar 30, 2017

View reviewed changes

lbdreyer merged commit 91eaef6 into SciTools:dask Mar 30, 2017

QuLogic removed the Status: Work in Progress label Mar 30, 2017

bjlittle moved this from In Review to Done in Iris-dask-fill-value-dtype Mar 30, 2017

bjlittle mentioned this pull request Apr 24, 2017

Cube data API refactor #2505

Closed

bjlittle deleted the dask-data-manager branch May 2, 2017 13:02

bjlittle added a commit to bjlittle/iris that referenced this pull request May 31, 2017

Dask data manager (SciTools#2461)

3bbd675

Add data manager with tests.

QuLogic modified the milestones: dask, v2.0 Aug 2, 2017

Dask data manager #2461

Dask data manager #2461

Conversation

bjlittle commented Mar 28, 2017 • edited Loading

bjlittle commented Mar 28, 2017

lbdreyer left a comment

Choose a reason for hiding this comment

lbdreyer Mar 28, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bjlittle Mar 29, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bjlittle commented Mar 29, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lbdreyer commented Mar 29, 2017

bjlittle commented Mar 30, 2017

pelson commented Mar 30, 2017

bjlittle commented Mar 30, 2017

pelson commented Mar 30, 2017 • edited Loading

pelson commented Mar 30, 2017

Choose a reason for hiding this comment

bjlittle Mar 30, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lbdreyer Mar 30, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bjlittle commented Mar 30, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lbdreyer commented Mar 30, 2017

bjlittle commented Mar 28, 2017 •

edited

Loading

lbdreyer Mar 28, 2017 •

edited

Loading

bjlittle Mar 29, 2017 •

edited

Loading

pelson commented Mar 30, 2017 •

edited

Loading

bjlittle Mar 30, 2017 •

edited

Loading

lbdreyer Mar 30, 2017 •

edited

Loading