Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define AttributeDict #948

Merged
merged 7 commits into from
Jun 9, 2022
Merged

Define AttributeDict #948

merged 7 commits into from
Jun 9, 2022

Conversation

simonbyrne
Copy link
Collaborator

@simonbyrne simonbyrne commented Jun 4, 2022

This follows on from #943 (comment)

It defines attrs(obj) which returns an AttibuteDict: this is mostly the same as Attributes object, except:

  • it is a subtype of AbstractDict{String, Any}, and supports the relevant features (keys, pairs, values)
  • getindex will read instead of returning an Attribute
  • it supports delete!
  • it supports overwriting attributes

TODO:

  • update docs
  • deprecate attributes
  • deprecate getindex and setindex! on Datasets for writing attributes

@mkitti
Copy link
Member

mkitti commented Jun 5, 2022

I'm not sure if we need to deprecate attributes. I think there might be some use in making that API type safe by parameterizing Attribute.

@simonbyrne
Copy link
Collaborator Author

I'm not sure if we need to deprecate attributes. I think there might be some use in making that API type safe by parameterizing Attribute.

It's kind of confusing to keep both, and everything it does can be better handled by the mid-level APIs.

@mkitti
Copy link
Member

mkitti commented Jun 5, 2022

Even if we are to deprecate attributes, we still need to test that the old interface continues to work until we remove it.

For the moment, we need to duplicate the tests, not replace them. We can remove the tests when we actually remove the API.

Copy link
Member

@mkitti mkitti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I like the idea of having a AbstractDict{String, Any} interface emulating h5py. The interface is simple to use. However, we should also figure out a way to provide a Julian interface with additional type safe features.

We should retain the tests for the current attributes API until removal. We cannot remove this without a deprecation period and at least a minor version bump.

I would also prefer to have deprecation in another PR while we evaluate the matter. It would also help to narrow the diff. We should consider if there might be a Julian way to emulate the C++ highfive interface with parametric types:
https://bluebrain.github.io/HighFive/class_high_five_1_1_attribute.html

One consideration is we could also specify the value type so that we can have an AttributeDict{T} <: AbstractDict{String, T} where T rather than the value being Any. Accessing attributes via this interface should use convert to coerce the type.

In summary, I think the easiest path here would be to focus on adding the new interface and make it optionally type safe.

src/api/helpers.jl Outdated Show resolved Hide resolved
test/gc.jl Outdated Show resolved Hide resolved
test/plain.jl Outdated Show resolved Hide resolved
test/properties.jl Outdated Show resolved Hide resolved
src/attributes.jl Show resolved Hide resolved
@@ -188,45 +193,82 @@ function h5readattr(filename, name::AbstractString)
end


struct Attributes
parent::Union{File,Object}
struct AttributeDict <: AbstractDict{String,Any}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's consider parameterizing this so that we can have a more specific type than Any, which can be the default is not specified.

Copy link
Collaborator Author

@simonbyrne simonbyrne Jun 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how that could be made to work, or really what advantage it would give. Can you expand on what you had in mind? What would happen if you attempted to read or write an attribute of a different type?

My intention was to provide a simple, easy-to-use interface. I think if you want to specify the return type, it would be better to do in the mid-level interface, e.g. adding a return type to read_attribute.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps AttributeDict{T} could use convert(T, ...)? If convert throws an error, then so be it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It adds a lot of complexity and potential brittleness, and I'm still skeptical as to why it would be helpful? The files I've seen with attributes use a mix of different types (typically strings, integers and vectors of integers).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's my prototype diff:
JaneliaSciComp@c3a5db3
I added a try to Base.iterate to skip over items where convert fails.

Type stability can be helpful. Since I'm accessing the HDF5 file in Julia rather than Python, I would like to take advantage of the type system. In many cases, I may have prior knowledge of the type I'm expecting the attribute to be, or I can glean that from the API. This provides an easy way to enforce knowledge of the type, allowing subsequent code to be type stable without having to use a completely different API. It also throws an error if my assumptions are violated.

Having the parameter default to Any produces the exact same results you have now. Allowing for a parameter allows the same API to be extended to encourage type stability.

julia> attrs(h5f)
AttributeDict of HDF5.Group: / (file: test.h5) with 4 attributes:
  "bye"     => "3.0"
  "hello"   => 5
  "someint" => 3
  "test"    => 5.5

julia> attrs(h5f, Int)
AttributeDict of HDF5.Group: / (file: test.h5) with 4 attributes:
  "hello"   => 5
  "someint" => 3

julia> attrs(h5f, String)
AttributeDict of HDF5.Group: / (file: test.h5) with 4 attributes:
  "bye" => "3.0"

julia> attrs(h5f, Float64)
AttributeDict of HDF5.Group: / (file: test.h5) with 4 attributes:
  "hello"   => 5.0
  "someint" => 3.0
  "test"    => 5.5

julia> attrs(h5f, Int)["test"]
ERROR: InexactError: Int64(5.5)
Stacktrace:
 [1] Int64
   @ ./float.jl:812 [inlined]
 [2] convert(#unused#::Type{Int64}, x::Float64)
   @ Base ./number.jl:7
 [3] getindex(x::HDF5.AttributeDict{Int64}, name::String)
   @ HDF5 ~/.julia/dev/HDF5/src/attributes.jl:224
 [4] top-level scope
   @ REPL[12]:1

There a few things to explore here, such as passing the parameter as the memtype to read_attribute, but I think this would allow this API to go a lot further. Since we are talking about parameterizing Dataset as well, I think adding a type parameter to AttributeDict would be consistent with future direction of this package.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are talking about parameterizing Dataset as well, I think adding a type parameter to AttributeDict would be consistent with future direction of this package.

The analogy to Dataset would be adding a type parameter to Attribute though?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dataset would be struct Dataset{T,N} <: AbstractArray{T,N} or struct Dataset{T,N} <: AbstractDiskArray{T,N}.

I wanted to explore if Attribute could become Attribute{T} and have a corresponding struct Attributes{T2} <: AbstractDict{String, Attribute{<:T2}}. However, you want to deprecate that interface, which is why I'm exploring the idea here.

I would be fine if we parameterized the struct but only implemented methods for AttributeDict{Any} in this pull request.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't see the benefit of adding a parameter to AttributeDict: how would it help?

@simonbyrne
Copy link
Collaborator Author

I would also prefer to have deprecation in another PR while we evaluate the matter. It would also help to narrow the diff.

Okay, I can move the deprecations to a new PR.

We should consider if there might be a Julian way to emulate the C++ highfive interface with parametric types:
https://bluebrain.github.io/HighFive/class_high_five_1_1_attribute.html

I admit to not having much experience with C++, but this looks similar to the mid-level interface, combined with adding a type parameter or output array to read_attribute

@mkitti
Copy link
Member

mkitti commented Jun 7, 2022

I admit to not having much experience with C++, but this looks similar to the mid-level interface, combined with adding a type parameter or output array to read_attribute

Here's an excerpt from their code: https://github.com/BlueBrain/Brion/blob/2b42ecd9c251741c42970141f34d292aee53d991/brion/plugin/compartmentReportHDF5.cpp#L78-L98

bool _verifyFile(const HighFive::File& file)
{
    try
    {
        uint32_t magic = 0;
        file.getAttribute("magic").read(magic);
        if (magic != _sonataMagic)
            return false;

        std::vector<uint32_t> version;
        file.getAttribute("version").read(version);
        if (version.size() != 2 || version[0] != _currentVersion[0] ||
            version[1] != _currentVersion[1])
            return false;
    }
    catch (HighFive::Exception& e)
    {
        return false;
    }
    return true;
}

@simonbyrne
Copy link
Collaborator Author

Thanks that is helpful.

If we were to parametrise Attribute, would both magic and version be Attribute{UInt32}? or would version be Attribute{Vector{UInt32}}?

@mkitti
Copy link
Member

mkitti commented Jun 7, 2022

Thanks that is helpful.

If we were to parametrise Attribute, would both magic and version be Attribute{UInt32}? or would version be Attribute{Vector{UInt32}}?

julia> typeof(magic_attr)
Attribute{UInt32}

julia> typeof(version_attr)
Attribute{Vector{UInt32}}

version appears to be a Vector{UInt32} of with a length of 2.

@mkitti
Copy link
Member

mkitti commented Jun 7, 2022

The more I think about it, Attributes is really more like a NamedTuple or a struct than a Dict. We know the types of all of the attributes.

@simonbyrne simonbyrne changed the title Define AttibuteDict Define AttributeDict Jun 7, 2022
@simonbyrne
Copy link
Collaborator Author

The more I think about it, Attributes is really more like a NamedTuple or a struct than a Dict. We know the types of all of the attributes.

It's not static though: I can add or delete attributes

@simonbyrne
Copy link
Collaborator Author

version appears to be a Vector{UInt32} of with a length of 2.

I meant what would the attribute type parameter be?

@mkitti
Copy link
Member

mkitti commented Jun 7, 2022

Attribute{Vector{UInt32}} would correspond to "version". It's like a RefValue.

@musm
Copy link
Member

musm commented Jun 7, 2022

It's a shame we can't use attributes for this.

@musm
Copy link
Member

musm commented Jun 7, 2022

Could you move the deprecated methods to the src/deprecation.jl file?

@mkitti
Copy link
Member

mkitti commented Jun 7, 2022

I asked him to do the deprecation in another pull request.

@simonbyrne
Copy link
Collaborator Author

It's a shame we can't use attributes for this.

Agreed, but once it's deprecated it become easier.

src/attributes.jl Outdated Show resolved Hide resolved
@@ -178,8 +184,7 @@ function h5readattr(filename, name::AbstractString)
file = h5open(filename,"r")
try
obj = file[name]
a = attributes(obj)
dat = Dict(x => read(a[x]) for x in keys(a))
dat = Dict(attrs(obj))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dat = Dict(attrs(obj))
dat = Dict(AttributeDict(obj))

Internally, we should just use the constructor.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any difference?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, there is one less level of indirection. For no difference we need const attrs = AttributeDict.

@test haskey(attrs(f), "a")
@test attrs(f)["a"] == 1

attrs(f)["b"] = [2,3]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test attrs on groups and datasets as well.

@mkitti
Copy link
Member

mkitti commented Jun 8, 2022

I've had the opportunity to dig into read and analyze type stability. In principle, if we specified the Type to be returned we should be able to achieve type stability.

However, there are fundamental issues down to the _generic_read function that I just refactored. In particular, we specify the memory type, but we do not specify whether the result will be an array or a scalar based on an argument type. Thus the type stability is problematic down to the core of the reading mechanism. Currently calling Base.read(dataset, T) could return either T or Array{T}. To change this behavior would involve a breaking change. It cannot be achieved by modifying the mid-level API alone.

The only non-breaking route to type stable reads that I can see is by adding new methods. In a sense, the new copyto! is type stable. If we implemented, read!, I think we could also make that type stable. read! looks very much like the C++ API I posted above, which suggests that perhaps we should revisit implementing read!. In contrast, the best we could do without a significant overhaul at the moment is equivalent to adding a type assertion.

That said, I believe it would be possible to eventually use the following syntax to do type stable reads of attributes. It would just need to be rebased upon a lower level API than read_attribute.

magic = AttributeDict{UInt32}(file)["magic"]
version = AttributeDict{Vector{UInt32}}(file)["version"]

The advantage of this syntax is that the second type parameter for AbstractDict is clearly the return type.

My conclusion at the moment is that it is probably possible to achieve type stability through this new access mechanism, but I'm sympathetic to the idea that this might be too much work to be achieved in this pull request. Scoping everything to AttributeDict{Any} with the current implementation would allow a revision to expand this mechanism to enhance type stability and safety.

The main thing on my mind at the moment is expanding test coverage for Group, Dataset, and Datatype.

@musm
Copy link
Member

musm commented Jun 9, 2022

My conclusion is that we should proceed with the changes in this PR and revisit the type stability issue in other PRs. I don't think it's a high priority for the new interface. We should probably instead aim to get the new API in and then focus on stability later.

Personally, I'm not confident how much of a performance benefit stability could have, although I understand it's desirable in a lot of contexts.

However, there are fundamental issues down to the _generic_read function that I just refactored. In particular, we specify the memory type, but we do not specify whether the result will be an array or a scalar based on an argument type. Thus the type stability is problematic down to the core of the reading mechanism. Currently calling Base.read(dataset, T) could return either T or Array{T}. To change this behavior would involve a breaking change. It cannot be achieved by modifying the mid-level API alone.

I think we should look into this more seriously, even if it means a breaking change. Especially if it simplifies type-stability with the AttributeDict interface and other API methods. It sounds more robust than trying to monkey patch the AttributeDict interface for type stability, which doesn't sound entirely trivial with the current code.

@mkitti
Copy link
Member

mkitti commented Jun 9, 2022

Within HDF5.jl, the performance impact is likely not great. However, the performance impact could be significant in downstream applications as this makes dynamic dispatch much more likely. Moreover, the issue is about correctness. Can people write scientific programs using HDF5.jl that will be robust to variations in HDF5 files? Furthermore, can we balance convenience and correctness?

The immediate question is if there is anything that we can do here that will make iteration on this design easier in the future?

  1. Parameterize the type, but only implement the case for Any.
  2. Consider if we want attrs to be a method or an alias for AttributeDict
const attrs = AttributeDict
AttributeDict(x) = AttributeDict{Any}(x)

magic = attrs(file)["magic"]
version = attrs(file)["version"]

Later we might then implement the following:

magic = attrs{UInt32}(file)["magic"]
version = attrs{Vector{UInt32}}(file)["version"]

@simonbyrne
Copy link
Collaborator Author

  1. Parameterize the type, but only implement the case for Any.
  2. Consider if we want attrs to be a method or an alias for AttributeDict

I think both of those changes could easily be made at a later date without causing breakages.

For the record, I am opposed to 1, as I fail to see the benefits from the added complexity (I think type-stable access would be better done via read_attribute), and am ambivalent about 2.

@mkitti
Copy link
Member

mkitti commented Jun 9, 2022

Adding a type parameter after release could be breaking if someone directly accessed the type, so we should be clear that it is not part of the public API.

I do not see how qualifying the current implementation as AttributeDict{Any} adds significant complexity, but I'll propose this in a separate pull request.

Let's expand the testing to directly test groups, datasets, and datatypes, then merge.

@mkitti
Copy link
Member

mkitti commented Jun 9, 2022

Oops, I pushed 875807a here instead of my own fork. Any objections to the expanded testing?

Copy link
Member

@mkitti mkitti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge when ready

@@ -63,6 +63,16 @@ function Base.show(io::IO, attr::Attributes)
print(io, "Attributes of ", attr.parent)
end

Base.show(io::IO, attrdict::AttributeDict) = summary(io, attrdict)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does Julia end up printing the rest of the attributes ? Is that in some sort of fallback AbstractDict printing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah it's the 3 argument show method. Thanks, I was wondering why @which was sending me to a seemingly unrelated call.

@musm musm merged commit b1f36c1 into JuliaIO:master Jun 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants