[ABI] ABI Update For adding Version to DLPack #113

tqchen · 2022-11-09T15:38:32Z

This PR adds DLPackManagedTensorVersioned to DLPack

Add a new data structure DLManagedTensorVersioned, which contains a new version field.
Add DLPackVersion struct to indicate both api and abi versions.
Add flags field for boolean options, with support for read only bit mask.

The change is still ABI compatible, as the new data structure ABI comes with the DLManagedTensorVersioned. The data-api however would involve an ABI update and update of all DLPack importer/exporters to make use of DLManagedTensorVersioned.

tqchen · 2022-11-09T15:38:48Z

notice thread #110

tqchen · 2022-11-09T15:39:26Z

cc @leofang @rgommers @oleksandr-pavlyk @tirthasheshpatel @seberg @kkraus14 @wjakob

tqchen · 2022-11-09T15:40:17Z

ready for review, this PR will remain open for at least one week

seberg

Thanks, just some small comments/questions (most you can also ignore).

I want to think a bit more on it, but I think the main things I am still pondering is about the (Python) exchange protocol, which is an addition to the changes here.

seberg · 2022-11-09T16:33:54Z

include/dlpack/dlpack.h

+ * This is a helper flag to help dependent packages to check possible extensions.
+ *
+ * The dependent package should be static_assert(kExtendedFlag > kDLDeviceTypeEnd);
+ * to ensure that version update do not override existing flags.


I don't follow. Do dependent packages sometimes store more things in the device?

sometimes depenendent package can try to store extended device that are not yet part of DLPack (which can be included later). So this is just providing convenience for suhc packages

Hmmmm, but that seems tricky in a dynamically linked world? I cannot do static-asserts otherwise and would have to trust a version check.

But in that case, getting an older version, I would have to check that the device actually makes sense in the old version, even though I compile against the new version!

Can we say that "extended" versions MUST NOT be shared in a dynamically linked world like Python? Or alternatively, explicitly block off an "extended" area?

sometimes depenendent package can try to store extended device that are not yet part of DLPack

Would it not be better to set kDLDeviceTypeEnd to a really large integer to ensure that there are guaranteed to be no clashes? Like sort of reserved range for vendor-specific extensions that haven't been standardized yet.

I will suggest sacrificing the top bit for this purpose and making that explicit.

I realize that there are different opinions on this one and it is not necessarily closely tied with this proposal updat. As a result, I will remove it for now.

seberg · 2022-11-09T16:35:57Z

include/dlpack/dlpack.h


 /*! \brief The current ABI version of dlpack */
-#define DLPACK_ABI_VERSION 1
+#define DLPACK_ABI_VERSION 2


Technically, this doesn't even break ABI, did we (since only new symbols got added)?

I am still thinking a bit about how we should go about this. I.e. we could make the main DLManagedTensorVersioned struct always be the newest one and when we update, add a DLManagedTensorV1 (or add it now).
The main point being: When a new version is introduced, we need to keep the old one around so NumPy, etc. can maintain compatibility with older libraries more easily.

That makes sense for future update. On the other hand, keeping many versions in a single file can be prohibitive, so ideally we want to only rotate among two versions. We can do so after another ABI update.

This is indeed not a ABI breaking change, on the other hand, it does indicate a change of ABI in the Data API exchange(of what is being contained in pycapsule), as a result the update

That is fine... You could also start backcompat having headers like dlpack_v1.h. They won't matter too much if they pile up a bit and we can decide how long we want to keep old stuff around.

seberg · 2022-11-09T16:53:30Z

include/dlpack/dlpack.h

+typedef enum {
+#endif
+ /*! \brief bit mask to indicate that the tensor is read only. */
+ kDLFlagBitMaskReadOnly = 1UL,


I like the habit of writing 1UL << 0 for flags, but it doesn't matter for one flag :).

What would be interesting is to note whether we want any other flags already at this point!

seberg · 2022-11-09T16:57:53Z

include/dlpack/dlpack.h

+ *
+ * \sa DLPackFlagBitMask
+ */
+ uint64_t flags;


It would be good to note somewhere that we guarantee the fields remain ABI stable until here (I guess including the flags).
That is because consumers must be able to call the destructor even if they don't understand the version.

But after this, everything may be version dependent (it may make sense to specify whether growing the struct at the end is considered valid without bumping ABI in theory).

Yes, this is the key part, and it would be useful to summarize the high-level idea in this header since it's the most important file of the entire repository.

If we implement DLPack version X and get a tensor of version Y (where Y>X), what are we allowed to do? What are we not allowed to do?

There are also two version numbers: the DLpack version, and the ABI version. Is the first one completely irrelevant in deciding what is valid/disallowed, or does it also play a potential role? If yes, how is it different from the ABI version? If no, why does it need to be part of the ABI via a field stored in data structures?

It could be a major.minor and minor allows e.g. additional (compatible!) flags or additional devices/dtypes (not strictly compatible, but clearly can't fail). Not sure what the current scheme is?

Added some clarification of the current scheme

wjakob · 2022-11-09T19:24:56Z

include/dlpack/dlpack.h

- * if there is no way for the caller to provide a reasonable destructor.
- * The destructors deletes the argument self as well.
+ /*!
+ * \brief Destructor - this should be called


nit picking: formatting is a little messed up (double spaces, rows of different length).

Edit: That also applies to a few of the other newly added comments, but I will just mention it once here to avoid spamming the PR.

Also: I think the idea of the doxygen \brief tag is that the description is actually brief (1 sentence) followed by a newline to separate it from a more thorough comment. The \brief part is used as a preview.

great point, fixed for this particular case. we can do further cleanups

include/dlpack/dlpack.h

wjakob · 2022-11-13T19:12:46Z

I feel like the interface could still be more explicit about what I would argue is the most important point about a versioned ABI: what happens when a program is given a DLPack tensor by some other tool, and the DLPackVersion::abi value of this tensor is higher than what the current program was complied with?

With the breakneck speed of advances happening with ML and array programming, it is not inconceivable that the DLManagedTensorVersioned data structure will need to be extended in the future, requiring another bump to the DLPACK_ABI_VERSION value. Here are just a few examples from the top of my head, I am sure that we could brainstorm to come up with further and more convincing use cases if needed:

In future complex application involving accelerators, the device ID may not be descriptive enough. For example, a single device might be associated with multiple execution contexts, which would then need to be specified. Actually, this is already possible with CUDA today and such tensors cannot be safely exchanged because they live in separate address spaces.
As ML models become larger and more distributed, DLPack might need to be extended to refer to tensors on network-attached accelerators.
Experimentation with tensor data types (e.g. beyond BFloat16) might lead to a situation where more context is needed to explain the actual data type of a tensor (i.e., in a way that doesn't neatly fit into the existing enumeration).

If such extensions are added, the would likely be irrelevant for most users except for a small sub-community that requires them. And that's basically the point I want to raise: It would be a shame if simply adding an extra field will require another ABI break.

The current PR contains the message

\note Future ABI changes should keep everything until this field stable, to ensure that deleter can be correctly called.

(and then the actual things of interest follow, specifically flags and dl_tensor)

I find this quite conservative. Reading between the lines, to me this statement suggests that any disagreement in the DLPackVersion::abi field will need to break interoperability between frameworks exchanging DLPack tensors. The only thing one is allowed at that point is to give up and call the deleter.

In practice, it would be perfectly fine to add fields to the data structure without breaking interoperability with a lower-versioned consumer, especially when those extra fields encode nuanced attributes that are perhaps only relevant to very specific use cases.

ABI breaks are extremely disruptive to the community, especially in the Python ecosystem where everyone ships separately pre-compiled packages via PyPI. Even for this proposal, I suspect that will take a long time for the community to adopt versioned DLTensors. Therefore, this is also a unique chance to lower the cost of future changes, and I feel like the current PR misses that opportunity.

wjakob · 2022-11-13T19:45:08Z

To turn the above complaint into something more constructive, let me propose a three-part (MAJOR.MINOR.PATCH) DLPackVersion versioning scheme with the following interpretation:

A changes in a DLPack tensor's PATCH level indicates minor changes to this header file on the producer's side, such as the addition of a new enumeration entry (a new device, a new tensor data type, etc.).

Example: Suppose a new BFLOAT24 tensor type is added. It is legal for a DLPack consumer to read fields from a future PATCH level subject to the constraint that it can interpret the enumeration values it sees. If the consumer does not know how to interpret the BFLOAT24 type, it should give up and call the deleter.
A bump in the MINOR field indicates that a field was added at the end of the DLPackVersioned data structure (in addition to potential enumeration additions).

Example: Suppose that a new device ID is added (kDLNetworkedGPU) which refers to a network-capable GPU. Accessing data there requires an IP address and port field (lots of hand waving here, it's just an example.. :-)). As before, it is perfectly fine for a DLPack consumer to read fields from a tensor with a MINOR version from its relative future subject to the constraint that it can interpret the enumeration values it sees. In this example, it would be safe for the consumer to access the tensor as long as it doesn't have the the newly added kDLNetworkedGPU type. It would not access the extra field(s) that it is in any case not aware of.
A bump in the MAJOR field indicates some major restructuring. It is not safe to use a tensor if a major version disagreement is detected.

Example: the latest transformer model GPT-10 has now become so big that the DLTensor must be upgraded from int64_t *strides to int128_t *strides. That messes up the whole data layout, and a full ABI break is necessary. Even in this case, there is a guarantee, which is that the tensor data structure starts with the fields
```
DLPackVersion version;
void *manager_ctx;
void (*deleter)(struct DLManagedTensorVersioned *self);
```
This means that a consumer can at least call the deleter, but that's it.

tqchen · 2022-11-13T20:42:11Z

Thanks @wjakob for the proposal. I agree that we want to be careful with the changes and it is great to have those ideas discussed before we commit to them.

Would love to see what others think as well

leofang · 2022-11-26T02:30:49Z

This is an interesting discussion, but I would argue it is not of concern to the Python Data Consortium. Basically, all we need is a safe way to continue evolving the existing DLPack protocol for describing a span of tensor data residing on a single device (important to your second point below). IIRC there's no discussion on evolving the protocol itself, as so far everyone is mostly happy with the convenience brought by DLPack. There are plenty of prior discussion scattered in the open issues of this repo that are finally converging to this PR.

In future complex application involving accelerators, the device ID may not be descriptive enough. For example, a single device might be associated with multiple execution contexts, which would then need to be specified. Actually, this is already possible with CUDA today and such tensors cannot be safely exchanged because they live in separate address spaces.

The fact that such a possibility already exists but has never generated any change request to DLPack speaks for itself. In the case of multiple CUDA contexts co-residing on the same device, generally it is hard to exchange properly without some common understanding (ex: both sides must use the same CUDA context, which in practice almost always refers to the CUDA runtime context). It is even harder to take into account the needed API/ABI change to accommodate for similar situations in other vendors' programming models. If NV and Intel both consider this is appropriate (I do not speak for either on this), I'd argue let's not overcomplicate the story 🙂 Flexibility is good, but less so if we cannot make any progress.

As ML models become larger and more distributed, DLPack might need to be extended to refer to tensors on network-attached accelerators.

As mentioned above, DLPack has been designed as a single-device exchange protocol. This use case is exactly why array libraries like Dask and cuNumeric cannot support DLPack, CUDA Array Interface, or any similar protocol if they ever exist. Distributed protocols have been discussed extensively (ex: here), but the fact that it went nowhere is a big issue (no distributed library provider can find common ground) that should not block this PR.

Experimentation with tensor data types (e.g. beyond BFloat16) might lead to a situation where more context is needed to explain the actual data type of a tensor (i.e., in a way that doesn't neatly fit into the existing enumeration).

An actual data type here is represented by DLDataType, which contains enum, bits, and, lanes. If a tensor data type cannot be represented by these fields, it is arguable that they are too complicated to exchange through DLPack. But maybe I am biased on this.

seberg · 2022-11-26T08:32:08Z

My gut feeling is squashing "minor" and "patch" if fine (but I don't care either way).

There are two main groups of changes:

Breaking: Don't even try reading the data in this struct! (Beyond version and cleanup)
Other: May be incompatible, but you will notice as you go along (new device can break you if it is unified memory and you are a CPU library). But it is clearly an "unsupported device". The version might be useful, but just to include it in an error to the user: "maybe a version problem?"

From the Python side I am not scared about either kind of changes at all. But I ask for a version request (by the consumer). If they say version="1.3" and the exporter supports version="1.5" which added new devices (i.e. unified memory rather than just CPU), they can decide to not use those new devices.
Note that it would probably be OK to export with "1.5" as version for simplicity.

We don't have to discuss the details about what might happen, a true ABI break should be assumed to happen eventually. It doesn't matter much which scenario causes it :).

seberg · 2022-12-12T14:39:11Z

Should this be pushed forward soon? It seems to me that the versioning may be the only small discussion here.
Since further changes to the format should only affect the included tensor there seems not much else?

I am just hoping it might unblock further steps. I am trying to think about the exchange protocol more, but unless we change API more there is probably not much to do. For Python, I would update the capsule name and pass a maximum supported version (so that the exporter can produce older versions if they wish to).

It isn't that I can't think of API breaks to consider, but it seems OK to push this in any case.

tqchen · 2022-12-12T16:41:53Z

Yes, sorry was a bit slow near the end of the year .

Summarizing suggestion on versioning here. How about we do the following

struct DLPackVersion {
    // A number that indicates an ABI break  
    int32_t major;
    // A number that indicates a API change with backward compatible API.
    int32_t minor;
};

This would mark a change in versionin scheme, which is probably fine. Given DLPack is pre 1.0. We can start with 1.0 release with this new ABI change.

leofang · 2022-12-22T17:35:10Z

So with the proposed DLPackVersion it'd make DLPack semantically versioned. I like it too.

tqchen · 2023-01-04T15:26:04Z

Happy new year, seems we are all good with major, minor version change. @wjakob @leofang @seberg please take another look and approve explicitly if you can

seberg

Thanks, this seems like a good next step to me.

wjakob

I am sad to see minor and patch being squashed, but it seems I am in the minority. I have added a small request to include more information from the discussion here into the header file.

By the way, one off-topic point: the Python tensor.__dlpack__(version = ..) interface will need some careful thinking to make sure it doesn't end up being a pain. For example, a version number in string form as shown by @seberg above would require everyone to build custom version number parsers.

What's worse is that there is no good way to test if a Python function actually accepts a version argument other than calling it to see if it raises an exception (which is super-slow especially when combined with C++ bindings that need to catch and re-throw exceptions in different languages). Existing libraries that don't support versioned tensors (and won't, for a while) will fall into this category.

So.. I renew my request for a different API function (e.g. __vdlpack__(version: Optional[(int, int)]) for versioned DLPack). Anyways, probably a discussion for another PR, but I don't think this part will be as easy as it may seem to be.

wjakob · 2023-01-05T14:38:09Z

include/dlpack/dlpack.h

+ * data layout of the ABI - DLManagedTensorVersioned.
+ *
+ * A change in minor version indicates that we have added new
+ * code, such as a new device type, but the ABI is kept the same.


I would add something of the following sort just to spell it out clearly somewhere:

If an obtained DLPack tensor has a major version that disagrees with the version number specified in this header file (i.e. major != DLPACK_MAJOR_VERSION), the consumer must call the deleter. It is not safe to access any other fields as the memory layout will have changed. In the case of a minor version mismatch, the tensor can be safely used as long as the consumer knows how to interpret all fields. Minor version updates indicate the addition of enumeration values.

Thank you for the great suggestion, just add the comment per suggestion

Great suggestion +1

tqchen · 2023-01-05T15:02:51Z

Thanks @wjakob for suggestions, I have updated the comment to include your suggested clarifications. Let us move discussion about python api to #116

leofang

LGTM except for one minor nit.

include/dlpack/dlpack.h

This PR adds DLPackManagedTensorVersioned to DLPack - Add a new data structure DLManagedTensorVersioned, which contains a new version field. - Add DLPackVersion struct to indicate both api and abi versions. - Add flags field for boolean options, with support for read only bit mask. - Add kDLDeviceTypeEnd to help dependent packages to check range of possible extension values. The change is still ABI compatible, as the new data structure ABI comes with the DLManagedTensorVersioned. The data-api however would involve an ABI update and update of all DLPack importer/exporters to make use of DLManagedTensorVersioned.

Remove the device end notation for now. Change bitmask definition to be macro to avoid enum int64 difference between C and C++.

tqchen · 2023-01-06T16:57:25Z

@wjakob please let me know if you have more comments :)

tqchen · 2023-01-16T21:03:23Z

Going merge if there is no more input in the next three days

tqchen · 2023-01-24T14:10:25Z

Thanks everyone for participating in discussion and reiews, this change is now merged.

tqchen mentioned this pull request Nov 9, 2022

[NOTICE] ABI Update For adding Version to DLPack #110

Open

seberg reviewed Nov 9, 2022

View reviewed changes

wjakob reviewed Nov 9, 2022

View reviewed changes

include/dlpack/dlpack.h Outdated Show resolved Hide resolved

tqchen force-pushed the main branch from 091e64c to 0bbb70c Compare November 13, 2022 15:56

leofang mentioned this pull request Dec 4, 2022

Add kDLBool type #114

Merged

seberg mentioned this pull request Dec 13, 2022

Next steps for the Python API? #116

Closed

tqchen force-pushed the main branch 2 times, most recently from 80669fa to e86f014 Compare January 4, 2023 15:41

seberg approved these changes Jan 5, 2023

View reviewed changes

oleksandr-pavlyk approved these changes Jan 5, 2023

View reviewed changes

wjakob suggested changes Jan 5, 2023

View reviewed changes

tqchen force-pushed the main branch from 49607e5 to 385edb6 Compare January 5, 2023 15:05

kkraus14 approved these changes Jan 5, 2023

View reviewed changes

leofang approved these changes Jan 5, 2023

View reviewed changes

include/dlpack/dlpack.h Show resolved Hide resolved

include/dlpack/dlpack.h Show resolved Hide resolved

This was referenced Jan 5, 2023

Add ABI version. #72

Closed

[ABI break] Add new structs with version info and readonly flag #101

Closed

tqchen mentioned this pull request Jan 5, 2023

[ABI break] Append new fields to DLManagedTensor #105

Closed

leofang approved these changes Jan 5, 2023

View reviewed changes

tqchen added 4 commits January 5, 2023 18:12

Address comments.

8af4a52

Remove the device end notation for now. Change bitmask definition to be macro to avoid enum int64 difference between C and C++.

Add proposed major/minor version change

73dd3d5

More clarifications

d10634c

tqchen force-pushed the main branch from fbe9a9c to d10634c Compare January 5, 2023 23:12

leofang mentioned this pull request Jan 15, 2023

Update DLPack support cupy/cupy#7306

Open

2 tasks

tqchen merged commit ca4d00a into dmlc:main Jan 24, 2023

leofang mentioned this pull request Jan 24, 2023

Update DLPack support to v0.8 for bool arrays mpi4py/mpi4py#303

Merged

wjakob mentioned this pull request May 30, 2023

[BUG]: ndarray<> fails on read-only buffer object wjakob/nanobind#206

Closed

[ABI] ABI Update For adding Version to DLPack #113

[ABI] ABI Update For adding Version to DLPack #113

Conversation

tqchen commented Nov 9, 2022 • edited Loading

tqchen commented Nov 9, 2022

tqchen commented Nov 9, 2022

tqchen commented Nov 9, 2022

seberg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wjakob Nov 9, 2022 • edited Loading

Choose a reason for hiding this comment

wjakob Nov 9, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wjakob Nov 9, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wjakob commented Nov 13, 2022 • edited Loading

wjakob commented Nov 13, 2022 • edited Loading

tqchen commented Nov 13, 2022 • edited Loading

leofang commented Nov 26, 2022

seberg commented Nov 26, 2022

seberg commented Dec 12, 2022

tqchen commented Dec 12, 2022 • edited Loading

leofang commented Dec 22, 2022

tqchen commented Jan 4, 2023

seberg left a comment

Choose a reason for hiding this comment

wjakob left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tqchen commented Jan 5, 2023

leofang left a comment

Choose a reason for hiding this comment

tqchen commented Jan 6, 2023

tqchen commented Jan 16, 2023 • edited Loading

tqchen commented Jan 24, 2023

tqchen commented Nov 9, 2022 •

edited

Loading

wjakob Nov 9, 2022 •

edited

Loading

wjakob Nov 9, 2022 •

edited

Loading

wjakob Nov 9, 2022 •

edited

Loading

wjakob commented Nov 13, 2022 •

edited

Loading

wjakob commented Nov 13, 2022 •

edited

Loading

tqchen commented Nov 13, 2022 •

edited

Loading

tqchen commented Dec 12, 2022 •

edited

Loading

wjakob left a comment •

edited

Loading

tqchen commented Jan 16, 2023 •

edited

Loading