Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make CDL tests independent of attribute string typing #2133

Closed
wants to merge 2 commits into from

Conversation

marqh
Copy link
Member

@marqh marqh commented Sep 7, 2016

Behaviour changes in netCDF4-python have triggered test failures for v1.1 and 1.2 of netcdf4-python

With these versions, some attributes are typed at string where they previously were not. There is ongoing work and discussion which may alter this behaviour again, somewhere in netcdf-c netcdf4-python and/or hdf5

see:
Unidata/netcdf-c#298
Unidata/netcdf4-python#529
Unidata/netcdf4-python#575
amongst others for further details.

My suggestion is an explicit work around through assertCDL, into _assert_str_same to allow outputs to be interpreted as the same, even though they are manifestly not

I'm not sure I like this approach, but I find myself suggesting it.

@DPeterK
Copy link
Member

DPeterK commented Sep 14, 2016

This is not a nice change to have to be making - but that, as @marqh's comment above shows, is not our fault, more something that we're having to react to.

Functionally this change looks to do what we need it to (ignore \t\tstring in CML output). Pragmatically we need something like this change to help us move forward, and I think it can be argued that the rest of the CDL is the same whether \t\tstring is present or not.

My opinion then is that we grit our teeth and merge this 😬 But that's just my opinion; @bjlittle @pelson it would be nice to have some consensus on this...

@marqh
Copy link
Member Author

marqh commented Sep 26, 2016

My opinion then is that we grit our teeth and merge this 😬 But that's just my opinion;

understood

@bjlittle @pelson it would be nice to have some consensus on this...

indeed

# https://github.com/Unidata/netcdf4-python/issues/575
# amongst others for further details.
if type_comparison_name == 'CDL' and reference_str != test_str:
test_str = test_str.replace('\t\tstring ', '\t\t')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marqh Matching with a replace in this way is kinda open to making incorrect changes.

Perhaps you could be more rigorous and use a compiled regexp that also anchors the pattern match to the beginning of a line with globbed white space ...

Copy link
Member

@bjlittle bjlittle Sep 29, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marqh I really don't like the fact that we're burying this patch in the bowels of the testing framework, where it will be forgotten and perhaps sting us some time (long?) in the future.

Is there an opportunity here to elevate this change to be a registered iris patch? A proper thing (a first class iris citizen) that isn't hidden, but something that is very visible, easy to control, govern and perhaps at some point retire or deprecate.

Specifically, I'm thinking of a private iris patch file per module that contains all relevant patch functionality that we currently apply in that module. The application of such patching (to me) seems quite generic and specific to individual patching cases. But in this case I'm thinking of a specific patch decorator that we would apply to this specific method, where the patch decorator would do the regexp on the test_str, as required.

The specific patch decorator could even only be applied for netCDF v1.1+ and contain all relevant documented/PR details/history for other dev's in the future to understand the context of this specific patch ...

Thoughts? Or am I guilty of over thinking this ... I still believe it would be a simple change, but it's this kind of explicit patching strategy that is of more interest (to me) as we move forward

@marqh
Copy link
Member Author

marqh commented Oct 3, 2016

@bjlittle

I have updated the change to use a more careful regex/sub pattern.

@marqh I really don't like the fact that we're burying this patch in the bowels of the testing framework, where it will be forgotten and perhaps sting us some time (long?) in the future.

Is there an opportunity here to elevate this change to be a registered iris patch? A proper thing (a first class iris citizen) that isn't hidden, but something that is very visible, easy to control, govern and perhaps at some point retire or deprecate.

I don't think I agree at this point. This is a change to our testing to assert that two different files are actually the same, to work around instabilities in netcdf4-python versions.

It is arguable that the files are different and these tests should fail: that this change is wrong in principal. We can make that call.
This PR presents an alternative, that these files are different, but that that does not matter, so we should accept it.

So, I don't think this fits a patch strategy. If we don't think these files should test as the same, we need a more fundamental assessment of how to deal with the changes in netcdf4-python:
Unidata/netcdf-c#298
Unidata/netcdf4-python#529
Unidata/netcdf4-python#575

# tab|tab|string |attr_definition.
pattern = re.compile('(^\t\t)string (.+)$', re.MULTILINE)
# Replace with tab|tab|attr_definition.
replacement = r'\t\t\2'
Copy link
Member

@QuLogic QuLogic Oct 3, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A raw string means the tab isn't interpreted; how does it pass?

@QuLogic
Copy link
Member

QuLogic commented Oct 3, 2016

Can this be targeted at 1.10.x instead?

@pelson
Copy link
Member

pelson commented Oct 4, 2016

We have a number of options here:

  • we make iris behave consistently across multiple versions of netCDF4 python by forcing byte-string strings (NC_CHAR)
  • we accept the differences between versions heuristically (when testing), as per this PR
  • we enforce unicode strings (NC_STRING) when reading and writing from NetCDF
  • we introduce a heuristic to determine whether we save a string as NC_STRING or NC_CHAR

Enforcing NC_CHAR

Pros:

  • Simple to implement
  • This was the behaviour for ~ <=1.0.2 of netcdf4-python
  • All tests are already assuming this

Cons:

  • Unicode isn't supported without the user explicitly decoding the loaded byte-strings
  • Unpredictable type in loaded string depending on netcdf4-python heuristics (may be bytes/string, may be unicode)

Heuristic acceptance of NC_CHAR or NC_STRING when testing

Pros:

  • Simple to implement

Cons:

  • We will miss situations where we care about the NetCDF string type and could therefore easily miss an unexpected change of behaviour (though it is questionable whether we actually care).
  • Unpredictable behaviour for iris user across different netcdf4-python versions
  • Unpredictable behaviour for iris user when saving a cube which may or may not contain encodable ascii characters

Unicode strings throughout

Pros:

  • Predictable behaviour
  • Using unicode

Cons:

  • CDL may or may not have a string item for different netcdf4-python versions
  • Roundtrip will change strings to unicode
  • (Potential?) increased nc file size

Heuristic to save either NC_STRING or NC_CHAR

Pros:

  • Smaller file size than "Unicode strings throughout"
  • Repeatable behaviour, but somewhat arbitrarily string/unicode saving
  • Unicode is supported

Cons:

  • CDL may or may not have a string item for different netcdf4-python versions when saving attributes that can not be encoded as ASCII
  • Roundtrip will change unicode to string unless the unicode characters exist

I'll keep this comment up-to-date as more pros/cons/options emerge.

@pelson
Copy link
Member

pelson commented Oct 4, 2016

In order to put the above comment together in a coherent way, I was experimenting with our options and have put together #2158. I do not currently advocate that PR over this one, but would like to raise awareness of that as an option. While this PR represents a reduction in test certainty, PR #2158 represents a change in behaviour...

@marqh
Copy link
Member Author

marqh commented Oct 6, 2016

While this PR represents a reduction in test certainty, PR #2158 represents a change in behaviour...

I think we should be wary of a change in behaviour for all files, that would need to be justified

however, #2158 looks like all the test results are the same, so is this approach actually maintaining the same behaviour through netcdf4-python changes?

in #2158 none of the test results files are altered; apart from pep8 the tests pass

@pelson So, how is this a behaviour change, how does that manifest?

@marqh
Copy link
Member Author

marqh commented Oct 6, 2016

i think that #2158 represents a continuation of current Iris behaviour. I agree that it would be a behaviour reversion for people already using the latest netcdf4-python, but given the limited scope of the change, I think this represents a well constrained edge case

I think that #2158 is preferable to this PR and represents a sensible level of protection for Iris' netcdf output from netcdf4python changes, I suggest this PR is retired, unless #2158 proves problematic

@marqh
Copy link
Member Author

marqh commented Oct 6, 2016

Can this be targeted at 1.10.x instead?

i think that is a question for #2158, given my last comment

I will move the comment across to that ticket for further discussion. I am not convinced that it is required, but I have no particular objections. I want this in master and iris 1.11 which I currently feel needs to be imminent

@QuLogic
Copy link
Member

QuLogic commented Oct 10, 2016

#2158 replaced this one?

@pelson
Copy link
Member

pelson commented Oct 10, 2016

Thanks @QuLogic. It did indeed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants