Define AttributeDict #948

simonbyrne · 2022-06-04T21:57:52Z

This follows on from #943 (comment)

It defines attrs(obj) which returns an AttibuteDict: this is mostly the same as Attributes object, except:

it is a subtype of AbstractDict{String, Any}, and supports the relevant features (keys, pairs, values)
getindex will read instead of returning an Attribute
it supports delete!
it supports overwriting attributes

TODO:

update docs
deprecate attributes
deprecate getindex and setindex! on Datasets for writing attributes

mkitti · 2022-06-05T06:11:55Z

I'm not sure if we need to deprecate attributes. I think there might be some use in making that API type safe by parameterizing Attribute.

simonbyrne · 2022-06-05T06:19:21Z

I'm not sure if we need to deprecate attributes. I think there might be some use in making that API type safe by parameterizing Attribute.

It's kind of confusing to keep both, and everything it does can be better handled by the mid-level APIs.

mkitti · 2022-06-05T14:48:02Z

Even if we are to deprecate attributes, we still need to test that the old interface continues to work until we remove it.

For the moment, we need to duplicate the tests, not replace them. We can remove the tests when we actually remove the API.

mkitti

Overall I like the idea of having a AbstractDict{String, Any} interface emulating h5py. The interface is simple to use. However, we should also figure out a way to provide a Julian interface with additional type safe features.

We should retain the tests for the current attributes API until removal. We cannot remove this without a deprecation period and at least a minor version bump.

I would also prefer to have deprecation in another PR while we evaluate the matter. It would also help to narrow the diff. We should consider if there might be a Julian way to emulate the C++ highfive interface with parametric types:
https://bluebrain.github.io/HighFive/class_high_five_1_1_attribute.html

One consideration is we could also specify the value type so that we can have an AttributeDict{T} <: AbstractDict{String, T} where T rather than the value being Any. Accessing attributes via this interface should use convert to coerce the type.

In summary, I think the easiest path here would be to focus on adding the new interface and make it optionally type safe.

src/api/helpers.jl

test/gc.jl

test/plain.jl

test/properties.jl

src/attributes.jl

mkitti · 2022-06-05T17:33:27Z

src/attributes.jl

@@ -188,45 +193,82 @@ function h5readattr(filename, name::AbstractString)
 end


-struct Attributes
- parent::Union{File,Object}
+struct AttributeDict <: AbstractDict{String,Any}


Let's consider parameterizing this so that we can have a more specific type than Any, which can be the default is not specified.

I don't see how that could be made to work, or really what advantage it would give. Can you expand on what you had in mind? What would happen if you attempted to read or write an attribute of a different type?

My intention was to provide a simple, easy-to-use interface. I think if you want to specify the return type, it would be better to do in the mid-level interface, e.g. adding a return type to read_attribute.

Perhaps AttributeDict{T} could use convert(T, ...)? If convert throws an error, then so be it.

It adds a lot of complexity and potential brittleness, and I'm still skeptical as to why it would be helpful? The files I've seen with attributes use a mix of different types (typically strings, integers and vectors of integers).

Here's my prototype diff:
JaneliaSciComp@c3a5db3
I added a try to Base.iterate to skip over items where convert fails.

Type stability can be helpful. Since I'm accessing the HDF5 file in Julia rather than Python, I would like to take advantage of the type system. In many cases, I may have prior knowledge of the type I'm expecting the attribute to be, or I can glean that from the API. This provides an easy way to enforce knowledge of the type, allowing subsequent code to be type stable without having to use a completely different API. It also throws an error if my assumptions are violated.

Having the parameter default to Any produces the exact same results you have now. Allowing for a parameter allows the same API to be extended to encourage type stability.

julia> attrs(h5f) AttributeDict of HDF5.Group: / (file: test.h5) with 4 attributes: "bye" => "3.0" "hello" => 5 "someint" => 3 "test" => 5.5 julia> attrs(h5f, Int) AttributeDict of HDF5.Group: / (file: test.h5) with 4 attributes: "hello" => 5 "someint" => 3 julia> attrs(h5f, String) AttributeDict of HDF5.Group: / (file: test.h5) with 4 attributes: "bye" => "3.0" julia> attrs(h5f, Float64) AttributeDict of HDF5.Group: / (file: test.h5) with 4 attributes: "hello" => 5.0 "someint" => 3.0 "test" => 5.5 julia> attrs(h5f, Int)["test"] ERROR: InexactError: Int64(5.5) Stacktrace: [1] Int64 @ ./float.jl:812 [inlined] [2] convert(#unused#::Type{Int64}, x::Float64) @ Base ./number.jl:7 [3] getindex(x::HDF5.AttributeDict{Int64}, name::String) @ HDF5 ~/.julia/dev/HDF5/src/attributes.jl:224 [4] top-level scope @ REPL[12]:1

There a few things to explore here, such as passing the parameter as the memtype to read_attribute, but I think this would allow this API to go a lot further. Since we are talking about parameterizing Dataset as well, I think adding a type parameter to AttributeDict would be consistent with future direction of this package.

Since we are talking about parameterizing Dataset as well, I think adding a type parameter to AttributeDict would be consistent with future direction of this package.

The analogy to Dataset would be adding a type parameter to Attribute though?

Dataset would be struct Dataset{T,N} <: AbstractArray{T,N} or struct Dataset{T,N} <: AbstractDiskArray{T,N}.

I wanted to explore if Attribute could become Attribute{T} and have a corresponding struct Attributes{T2} <: AbstractDict{String, Attribute{<:T2}}. However, you want to deprecate that interface, which is why I'm exploring the idea here.

I would be fine if we parameterized the struct but only implemented methods for AttributeDict{Any} in this pull request.

I still don't see the benefit of adding a parameter to AttributeDict: how would it help?

simonbyrne · 2022-06-06T05:38:28Z

I would also prefer to have deprecation in another PR while we evaluate the matter. It would also help to narrow the diff.

Okay, I can move the deprecations to a new PR.

We should consider if there might be a Julian way to emulate the C++ highfive interface with parametric types:
https://bluebrain.github.io/HighFive/class_high_five_1_1_attribute.html

I admit to not having much experience with C++, but this looks similar to the mid-level interface, combined with adding a type parameter or output array to read_attribute

mkitti · 2022-06-07T00:20:50Z

I admit to not having much experience with C++, but this looks similar to the mid-level interface, combined with adding a type parameter or output array to read_attribute

Here's an excerpt from their code: https://github.com/BlueBrain/Brion/blob/2b42ecd9c251741c42970141f34d292aee53d991/brion/plugin/compartmentReportHDF5.cpp#L78-L98

bool _verifyFile(const HighFive::File& file)
{
    try
    {
        uint32_t magic = 0;
        file.getAttribute("magic").read(magic);
        if (magic != _sonataMagic)
            return false;

        std::vector<uint32_t> version;
        file.getAttribute("version").read(version);
        if (version.size() != 2 || version[0] != _currentVersion[0] ||
            version[1] != _currentVersion[1])
            return false;
    }
    catch (HighFive::Exception& e)
    {
        return false;
    }
    return true;
}

simonbyrne · 2022-06-07T04:46:27Z

Thanks that is helpful.

If we were to parametrise Attribute, would both magic and version be Attribute{UInt32}? or would version be Attribute{Vector{UInt32}}?

mkitti · 2022-06-07T04:56:23Z

Thanks that is helpful.

If we were to parametrise Attribute, would both magic and version be Attribute{UInt32}? or would version be Attribute{Vector{UInt32}}?

julia> typeof(magic_attr)
Attribute{UInt32}

julia> typeof(version_attr)
Attribute{Vector{UInt32}}

version appears to be a Vector{UInt32} of with a length of 2.

mkitti · 2022-06-07T04:58:16Z

The more I think about it, Attributes is really more like a NamedTuple or a struct than a Dict. We know the types of all of the attributes.

simonbyrne · 2022-06-07T05:04:37Z

The more I think about it, Attributes is really more like a NamedTuple or a struct than a Dict. We know the types of all of the attributes.

It's not static though: I can add or delete attributes

simonbyrne · 2022-06-07T05:06:42Z

version appears to be a Vector{UInt32} of with a length of 2.

I meant what would the attribute type parameter be?

mkitti · 2022-06-07T05:38:16Z

Attribute{Vector{UInt32}} would correspond to "version". It's like a RefValue.

musm · 2022-06-07T16:43:21Z

It's a shame we can't use attributes for this.

musm · 2022-06-07T16:45:49Z

Could you move the deprecated methods to the src/deprecation.jl file?

mkitti · 2022-06-07T17:44:47Z

I asked him to do the deprecation in another pull request.

simonbyrne · 2022-06-07T20:36:08Z

It's a shame we can't use attributes for this.

Agreed, but once it's deprecated it become easier.

src/attributes.jl

mkitti · 2022-06-08T09:09:06Z

src/attributes.jl

@@ -178,8 +184,7 @@ function h5readattr(filename, name::AbstractString)
 file = h5open(filename,"r")
 try
 obj = file[name]
- a = attributes(obj)
- dat = Dict(x => read(a[x]) for x in keys(a))
+ dat = Dict(attrs(obj))


Suggested change

dat = Dict(attrs(obj))

dat = Dict(AttributeDict(obj))

Internally, we should just use the constructor.

Is there any difference?

Currently, there is one less level of indirection. For no difference we need const attrs = AttributeDict.

mkitti · 2022-06-08T09:11:14Z

test/attributes.jl

+ @test haskey(attrs(f), "a")
+ @test attrs(f)["a"] == 1
+
+ attrs(f)["b"] = [2,3]


Test attrs on groups and datasets as well.

Co-authored-by: Mark Kittisopikul <[email protected]>

mkitti · 2022-06-08T23:53:05Z

I've had the opportunity to dig into read and analyze type stability. In principle, if we specified the Type to be returned we should be able to achieve type stability.

However, there are fundamental issues down to the _generic_read function that I just refactored. In particular, we specify the memory type, but we do not specify whether the result will be an array or a scalar based on an argument type. Thus the type stability is problematic down to the core of the reading mechanism. Currently calling Base.read(dataset, T) could return either T or Array{T}. To change this behavior would involve a breaking change. It cannot be achieved by modifying the mid-level API alone.

The only non-breaking route to type stable reads that I can see is by adding new methods. In a sense, the new copyto! is type stable. If we implemented, read!, I think we could also make that type stable. read! looks very much like the C++ API I posted above, which suggests that perhaps we should revisit implementing read!. In contrast, the best we could do without a significant overhaul at the moment is equivalent to adding a type assertion.

That said, I believe it would be possible to eventually use the following syntax to do type stable reads of attributes. It would just need to be rebased upon a lower level API than read_attribute.

magic = AttributeDict{UInt32}(file)["magic"]
version = AttributeDict{Vector{UInt32}}(file)["version"]

The advantage of this syntax is that the second type parameter for AbstractDict is clearly the return type.

My conclusion at the moment is that it is probably possible to achieve type stability through this new access mechanism, but I'm sympathetic to the idea that this might be too much work to be achieved in this pull request. Scoping everything to AttributeDict{Any} with the current implementation would allow a revision to expand this mechanism to enhance type stability and safety.

The main thing on my mind at the moment is expanding test coverage for Group, Dataset, and Datatype.

musm · 2022-06-09T03:57:29Z

My conclusion is that we should proceed with the changes in this PR and revisit the type stability issue in other PRs. I don't think it's a high priority for the new interface. We should probably instead aim to get the new API in and then focus on stability later.

Personally, I'm not confident how much of a performance benefit stability could have, although I understand it's desirable in a lot of contexts.

However, there are fundamental issues down to the _generic_read function that I just refactored. In particular, we specify the memory type, but we do not specify whether the result will be an array or a scalar based on an argument type. Thus the type stability is problematic down to the core of the reading mechanism. Currently calling Base.read(dataset, T) could return either T or Array{T}. To change this behavior would involve a breaking change. It cannot be achieved by modifying the mid-level API alone.

I think we should look into this more seriously, even if it means a breaking change. Especially if it simplifies type-stability with the AttributeDict interface and other API methods. It sounds more robust than trying to monkey patch the AttributeDict interface for type stability, which doesn't sound entirely trivial with the current code.

mkitti · 2022-06-09T08:53:51Z

Within HDF5.jl, the performance impact is likely not great. However, the performance impact could be significant in downstream applications as this makes dynamic dispatch much more likely. Moreover, the issue is about correctness. Can people write scientific programs using HDF5.jl that will be robust to variations in HDF5 files? Furthermore, can we balance convenience and correctness?

The immediate question is if there is anything that we can do here that will make iteration on this design easier in the future?

Parameterize the type, but only implement the case for Any.
Consider if we want attrs to be a method or an alias for AttributeDict

const attrs = AttributeDict
AttributeDict(x) = AttributeDict{Any}(x)

magic = attrs(file)["magic"]
version = attrs(file)["version"]

Later we might then implement the following:

magic = attrs{UInt32}(file)["magic"]
version = attrs{Vector{UInt32}}(file)["version"]

simonbyrne · 2022-06-09T18:29:46Z

Parameterize the type, but only implement the case for Any.

Consider if we want attrs to be a method or an alias for AttributeDict

I think both of those changes could easily be made at a later date without causing breakages.

For the record, I am opposed to 1, as I fail to see the benefits from the added complexity (I think type-stable access would be better done via read_attribute), and am ambivalent about 2.

mkitti · 2022-06-09T19:01:41Z

Adding a type parameter after release could be breaking if someone directly accessed the type, so we should be clear that it is not part of the public API.

I do not see how qualifying the current implementation as AttributeDict{Any} adds significant complexity, but I'll propose this in a separate pull request.

Let's expand the testing to directly test groups, datasets, and datatypes, then merge.

mkitti · 2022-06-09T20:03:04Z

Oops, I pushed 875807a here instead of my own fork. Any objections to the expanded testing?

mkitti

Merge when ready

musm · 2022-06-09T21:10:22Z

src/show.jl

@@ -63,6 +63,16 @@ function Base.show(io::IO, attr::Attributes)
 print(io, "Attributes of ", attr.parent)
 end

+Base.show(io::IO, attrdict::AttributeDict) = summary(io, attrdict)


How does Julia end up printing the rest of the attributes ? Is that in some sort of fallback AbstractDict printing?

Yes, the REPL printing is this one:
https://github.com/JuliaLang/julia/blob/803f90db9195c5c72df90d8a424c7066f1a8f2ee/base/show.jl#L133

Ah it's the 3 argument show method. Thanks, I was wondering why @which was sending me to a seemingly unrelated call.

Define AttibuteDict

3dde05a

update docs and deprecate

fddc71a

simonbyrne force-pushed the sb/attributedict branch from 1952ff8 to fddc71a Compare June 5, 2022 06:20

mkitti requested changes Jun 5, 2022

View reviewed changes

revert deprecation

4937d66

simonbyrne mentioned this pull request Jun 7, 2022

make iterators safe for error handling #950

Merged

simonbyrne changed the title ~~Define AttibuteDict~~ Define AttributeDict Jun 7, 2022

simonbyrne mentioned this pull request Jun 8, 2022

Read all attributes into a dictionary (or structure) #857

Closed

mkitti reviewed Jun 8, 2022

View reviewed changes

src/attributes.jl Outdated Show resolved Hide resolved

mkitti reviewed Jun 8, 2022

View reviewed changes

src/attributes.jl Show resolved Hide resolved

mkitti reviewed Jun 8, 2022

View reviewed changes

simonbyrne and others added 3 commits June 8, 2022 14:08

Merge branch 'master' into sb/attributedict

f489b9b

Update src/attributes.jl

1c95dff

Co-authored-by: Mark Kittisopikul <[email protected]>

Merge branch 'master' into sb/attributedict

3f55e48

Test AttributeDict on Group, Dataset, and Datatype

875807a

mkitti approved these changes Jun 9, 2022

View reviewed changes

musm reviewed Jun 9, 2022

View reviewed changes

musm merged commit b1f36c1 into JuliaIO:master Jun 9, 2022

Define AttributeDict #948

Define AttributeDict #948

Conversation

simonbyrne commented Jun 4, 2022 • edited Loading

mkitti commented Jun 5, 2022

simonbyrne commented Jun 5, 2022

mkitti commented Jun 5, 2022

mkitti left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

simonbyrne Jun 6, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

simonbyrne commented Jun 6, 2022

mkitti commented Jun 7, 2022

simonbyrne commented Jun 7, 2022

mkitti commented Jun 7, 2022

mkitti commented Jun 7, 2022

simonbyrne commented Jun 7, 2022

simonbyrne commented Jun 7, 2022

mkitti commented Jun 7, 2022

musm commented Jun 7, 2022

musm commented Jun 7, 2022

mkitti commented Jun 7, 2022

simonbyrne commented Jun 7, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mkitti commented Jun 8, 2022

musm commented Jun 9, 2022

mkitti commented Jun 9, 2022

simonbyrne commented Jun 9, 2022

mkitti commented Jun 9, 2022

mkitti commented Jun 9, 2022

mkitti left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

simonbyrne commented Jun 4, 2022 •

edited

Loading

simonbyrne Jun 6, 2022 •

edited

Loading