Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ABI] ABI Update For adding Version to DLPack #113

Merged
merged 4 commits into from
Jan 24, 2023
Merged

Conversation

tqchen
Copy link
Member

@tqchen tqchen commented Nov 9, 2022

This PR adds DLPackManagedTensorVersioned to DLPack

  • Add a new data structure DLManagedTensorVersioned, which contains a new version field.
  • Add DLPackVersion struct to indicate both api and abi versions.
  • Add flags field for boolean options, with support for read only bit mask.

The change is still ABI compatible, as the new data structure ABI comes with the DLManagedTensorVersioned. The data-api however would involve an ABI update and update of all DLPack importer/exporters to make use of DLManagedTensorVersioned.

@tqchen
Copy link
Member Author

tqchen commented Nov 9, 2022

notice thread #110

@tqchen
Copy link
Member Author

tqchen commented Nov 9, 2022

@tqchen
Copy link
Member Author

tqchen commented Nov 9, 2022

ready for review, this PR will remain open for at least one week

Copy link
Contributor

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, just some small comments/questions (most you can also ignore).

I want to think a bit more on it, but I think the main things I am still pondering is about the (Python) exchange protocol, which is an addition to the changes here.

* This is a helper flag to help dependent packages to check possible extensions.
*
* The dependent package should be static_assert(kExtendedFlag > kDLDeviceTypeEnd);
* to ensure that version update do not override existing flags.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow. Do dependent packages sometimes store more things in the device?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sometimes depenendent package can try to store extended device that are not yet part of DLPack (which can be included later). So this is just providing convenience for suhc packages

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm, but that seems tricky in a dynamically linked world? I cannot do static-asserts otherwise and would have to trust a version check.

But in that case, getting an older version, I would have to check that the device actually makes sense in the old version, even though I compile against the new version!

Can we say that "extended" versions MUST NOT be shared in a dynamically linked world like Python? Or alternatively, explicitly block off an "extended" area?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sometimes depenendent package can try to store extended device that are not yet part of DLPack

Would it not be better to set kDLDeviceTypeEnd to a really large integer to ensure that there are guaranteed to be no clashes? Like sort of reserved range for vendor-specific extensions that haven't been standardized yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will suggest sacrificing the top bit for this purpose and making that explicit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize that there are different opinions on this one and it is not necessarily closely tied with this proposal updat. As a result, I will remove it for now.


/*! \brief The current ABI version of dlpack */
#define DLPACK_ABI_VERSION 1
#define DLPACK_ABI_VERSION 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, this doesn't even break ABI, did we (since only new symbols got added)?

I am still thinking a bit about how we should go about this. I.e. we could make the main DLManagedTensorVersioned struct always be the newest one and when we update, add a DLManagedTensorV1 (or add it now).
The main point being: When a new version is introduced, we need to keep the old one around so NumPy, etc. can maintain compatibility with older libraries more easily.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense for future update. On the other hand, keeping many versions in a single file can be prohibitive, so ideally we want to only rotate among two versions. We can do so after another ABI update.

This is indeed not a ABI breaking change, on the other hand, it does indicate a change of ABI in the Data API exchange(of what is being contained in pycapsule), as a result the update

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is fine... You could also start backcompat having headers like dlpack_v1.h. They won't matter too much if they pile up a bit and we can decide how long we want to keep old stuff around.

typedef enum {
#endif
/*! \brief bit mask to indicate that the tensor is read only. */
kDLFlagBitMaskReadOnly = 1UL,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the habit of writing 1UL << 0 for flags, but it doesn't matter for one flag :).

What would be interesting is to note whether we want any other flags already at this point!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

*
* \sa DLPackFlagBitMask
*/
uint64_t flags;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to note somewhere that we guarantee the fields remain ABI stable until here (I guess including the flags).
That is because consumers must be able to call the destructor even if they don't understand the version.

But after this, everything may be version dependent (it may make sense to specify whether growing the struct at the end is considered valid without bumping ABI in theory).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree

Copy link

@wjakob wjakob Nov 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is the key part, and it would be useful to summarize the high-level idea in this header since it's the most important file of the entire repository.

If we implement DLPack version X and get a tensor of version Y (where Y>X), what are we allowed to do? What are we not allowed to do?

Copy link

@wjakob wjakob Nov 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are also two version numbers: the DLpack version, and the ABI version. Is the first one completely irrelevant in deciding what is valid/disallowed, or does it also play a potential role? If yes, how is it different from the ABI version? If no, why does it need to be part of the ABI via a field stored in data structures?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be a major.minor and minor allows e.g. additional (compatible!) flags or additional devices/dtypes (not strictly compatible, but clearly can't fail). Not sure what the current scheme is?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some clarification of the current scheme

* if there is no way for the caller to provide a reasonable destructor.
* The destructors deletes the argument self as well.
/*!
* \brief Destructor - this should be called
Copy link

@wjakob wjakob Nov 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit picking: formatting is a little messed up (double spaces, rows of different length).

Edit: That also applies to a few of the other newly added comments, but I will just mention it once here to avoid spamming the PR.

Also: I think the idea of the doxygen \brief tag is that the description is actually brief (1 sentence) followed by a newline to separate it from a more thorough comment. The \brief part is used as a preview.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great point, fixed for this particular case. we can do further cleanups

include/dlpack/dlpack.h Outdated Show resolved Hide resolved
@wjakob
Copy link

wjakob commented Nov 13, 2022

I feel like the interface could still be more explicit about what I would argue is the most important point about a versioned ABI: what happens when a program is given a DLPack tensor by some other tool, and the DLPackVersion::abi value of this tensor is higher than what the current program was complied with?

With the breakneck speed of advances happening with ML and array programming, it is not inconceivable that the DLManagedTensorVersioned data structure will need to be extended in the future, requiring another bump to the DLPACK_ABI_VERSION value. Here are just a few examples from the top of my head, I am sure that we could brainstorm to come up with further and more convincing use cases if needed:

  • In future complex application involving accelerators, the device ID may not be descriptive enough. For example, a single device might be associated with multiple execution contexts, which would then need to be specified. Actually, this is already possible with CUDA today and such tensors cannot be safely exchanged because they live in separate address spaces.

  • As ML models become larger and more distributed, DLPack might need to be extended to refer to tensors on network-attached accelerators.

  • Experimentation with tensor data types (e.g. beyond BFloat16) might lead to a situation where more context is needed to explain the actual data type of a tensor (i.e., in a way that doesn't neatly fit into the existing enumeration).

If such extensions are added, the would likely be irrelevant for most users except for a small sub-community that requires them. And that's basically the point I want to raise: It would be a shame if simply adding an extra field will require another ABI break.

The current PR contains the message

\note Future ABI changes should keep everything until this field stable, to ensure that deleter can be correctly called.

(and then the actual things of interest follow, specifically flags and dl_tensor)

I find this quite conservative. Reading between the lines, to me this statement suggests that any disagreement in the DLPackVersion::abi field will need to break interoperability between frameworks exchanging DLPack tensors. The only thing one is allowed at that point is to give up and call the deleter.

In practice, it would be perfectly fine to add fields to the data structure without breaking interoperability with a lower-versioned consumer, especially when those extra fields encode nuanced attributes that are perhaps only relevant to very specific use cases.

ABI breaks are extremely disruptive to the community, especially in the Python ecosystem where everyone ships separately pre-compiled packages via PyPI. Even for this proposal, I suspect that will take a long time for the community to adopt versioned DLTensors. Therefore, this is also a unique chance to lower the cost of future changes, and I feel like the current PR misses that opportunity.

@wjakob
Copy link

wjakob commented Nov 13, 2022

To turn the above complaint into something more constructive, let me propose a three-part (MAJOR.MINOR.PATCH) DLPackVersion versioning scheme with the following interpretation:

  1. A changes in a DLPack tensor's PATCH level indicates minor changes to this header file on the producer's side, such as the addition of a new enumeration entry (a new device, a new tensor data type, etc.).

    Example: Suppose a new BFLOAT24 tensor type is added. It is legal for a DLPack consumer to read fields from a future PATCH level subject to the constraint that it can interpret the enumeration values it sees. If the consumer does not know how to interpret the BFLOAT24 type, it should give up and call the deleter.

  2. A bump in the MINOR field indicates that a field was added at the end of the DLPackVersioned data structure (in addition to potential enumeration additions).

    Example: Suppose that a new device ID is added (kDLNetworkedGPU) which refers to a network-capable GPU. Accessing data there requires an IP address and port field (lots of hand waving here, it's just an example.. :-)). As before, it is perfectly fine for a DLPack consumer to read fields from a tensor with a MINOR version from its relative future subject to the constraint that it can interpret the enumeration values it sees. In this example, it would be safe for the consumer to access the tensor as long as it doesn't have the the newly added kDLNetworkedGPU type. It would not access the extra field(s) that it is in any case not aware of.

  3. A bump in the MAJOR field indicates some major restructuring. It is not safe to use a tensor if a major version disagreement is detected.

    Example: the latest transformer model GPT-10 has now become so big that the DLTensor must be upgraded from int64_t *strides to int128_t *strides. That messes up the whole data layout, and a full ABI break is necessary. Even in this case, there is a guarantee, which is that the tensor data structure starts with the fields

    DLPackVersion version;
    void *manager_ctx;
    void (*deleter)(struct DLManagedTensorVersioned *self);

    This means that a consumer can at least call the deleter, but that's it.

@tqchen
Copy link
Member Author

tqchen commented Nov 13, 2022

Thanks @wjakob for the proposal. I agree that we want to be careful with the changes and it is great to have those ideas discussed before we commit to them.

Would love to see what others think as well

@leofang
Copy link
Contributor

leofang commented Nov 26, 2022

This is an interesting discussion, but I would argue it is not of concern to the Python Data Consortium. Basically, all we need is a safe way to continue evolving the existing DLPack protocol for describing a span of tensor data residing on a single device (important to your second point below). IIRC there's no discussion on evolving the protocol itself, as so far everyone is mostly happy with the convenience brought by DLPack. There are plenty of prior discussion scattered in the open issues of this repo that are finally converging to this PR.

  • In future complex application involving accelerators, the device ID may not be descriptive enough. For example, a single device might be associated with multiple execution contexts, which would then need to be specified. Actually, this is already possible with CUDA today and such tensors cannot be safely exchanged because they live in separate address spaces.

The fact that such a possibility already exists but has never generated any change request to DLPack speaks for itself. In the case of multiple CUDA contexts co-residing on the same device, generally it is hard to exchange properly without some common understanding (ex: both sides must use the same CUDA context, which in practice almost always refers to the CUDA runtime context). It is even harder to take into account the needed API/ABI change to accommodate for similar situations in other vendors' programming models. If NV and Intel both consider this is appropriate (I do not speak for either on this), I'd argue let's not overcomplicate the story 🙂 Flexibility is good, but less so if we cannot make any progress.

  • As ML models become larger and more distributed, DLPack might need to be extended to refer to tensors on network-attached accelerators.

As mentioned above, DLPack has been designed as a single-device exchange protocol. This use case is exactly why array libraries like Dask and cuNumeric cannot support DLPack, CUDA Array Interface, or any similar protocol if they ever exist. Distributed protocols have been discussed extensively (ex: here), but the fact that it went nowhere is a big issue (no distributed library provider can find common ground) that should not block this PR.

  • Experimentation with tensor data types (e.g. beyond BFloat16) might lead to a situation where more context is needed to explain the actual data type of a tensor (i.e., in a way that doesn't neatly fit into the existing enumeration).

An actual data type here is represented by DLDataType, which contains enum, bits, and, lanes. If a tensor data type cannot be represented by these fields, it is arguable that they are too complicated to exchange through DLPack. But maybe I am biased on this.

@seberg
Copy link
Contributor

seberg commented Nov 26, 2022

My gut feeling is squashing "minor" and "patch" if fine (but I don't care either way).

There are two main groups of changes:

  • Breaking: Don't even try reading the data in this struct! (Beyond version and cleanup)
  • Other: May be incompatible, but you will notice as you go along (new device can break you if it is unified memory and you are a CPU library). But it is clearly an "unsupported device". The version might be useful, but just to include it in an error to the user: "maybe a version problem?"

From the Python side I am not scared about either kind of changes at all. But I ask for a version request (by the consumer). If they say version="1.3" and the exporter supports version="1.5" which added new devices (i.e. unified memory rather than just CPU), they can decide to not use those new devices.
Note that it would probably be OK to export with "1.5" as version for simplicity.

We don't have to discuss the details about what might happen, a true ABI break should be assumed to happen eventually. It doesn't matter much which scenario causes it :).

@leofang leofang mentioned this pull request Dec 4, 2022
@seberg
Copy link
Contributor

seberg commented Dec 12, 2022

Should this be pushed forward soon? It seems to me that the versioning may be the only small discussion here.
Since further changes to the format should only affect the included tensor there seems not much else?

I am just hoping it might unblock further steps. I am trying to think about the exchange protocol more, but unless we change API more there is probably not much to do. For Python, I would update the capsule name and pass a maximum supported version (so that the exporter can produce older versions if they wish to).

It isn't that I can't think of API breaks to consider, but it seems OK to push this in any case.

@tqchen
Copy link
Member Author

tqchen commented Dec 12, 2022

Yes, sorry was a bit slow near the end of the year .

Summarizing suggestion on versioning here. How about we do the following

struct DLPackVersion {
    // A number that indicates an ABI break  
    int32_t major;
    // A number that indicates a API change with backward compatible API.
    int32_t minor;
};

This would mark a change in versionin scheme, which is probably fine. Given DLPack is pre 1.0. We can start with 1.0 release with this new ABI change.

@leofang
Copy link
Contributor

leofang commented Dec 22, 2022

So with the proposed DLPackVersion it'd make DLPack semantically versioned. I like it too.

@tqchen
Copy link
Member Author

tqchen commented Jan 4, 2023

Happy new year, seems we are all good with major, minor version change. @wjakob @leofang @seberg please take another look and approve explicitly if you can

@tqchen tqchen force-pushed the main branch 2 times, most recently from 80669fa to e86f014 Compare January 4, 2023 15:41
Copy link
Contributor

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this seems like a good next step to me.

Copy link

@wjakob wjakob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sad to see minor and patch being squashed, but it seems I am in the minority. I have added a small request to include more information from the discussion here into the header file.

By the way, one off-topic point: the Python tensor.__dlpack__(version = ..) interface will need some careful thinking to make sure it doesn't end up being a pain. For example, a version number in string form as shown by @seberg above would require everyone to build custom version number parsers.

What's worse is that there is no good way to test if a Python function actually accepts a version argument other than calling it to see if it raises an exception (which is super-slow especially when combined with C++ bindings that need to catch and re-throw exceptions in different languages). Existing libraries that don't support versioned tensors (and won't, for a while) will fall into this category.

So.. I renew my request for a different API function (e.g. __vdlpack__(version: Optional[(int, int)]) for versioned DLPack). Anyways, probably a discussion for another PR, but I don't think this part will be as easy as it may seem to be.

* data layout of the ABI - DLManagedTensorVersioned.
*
* A change in minor version indicates that we have added new
* code, such as a new device type, but the ABI is kept the same.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add something of the following sort just to spell it out clearly somewhere:

If an obtained DLPack tensor has a major version that disagrees with the version number specified in this header file (i.e. major != DLPACK_MAJOR_VERSION), the consumer must call the deleter. It is not safe to access any other fields as the memory layout will have changed. In the case of a minor version mismatch, the tensor can be safely used as long as the consumer knows how to interpret all fields. Minor version updates indicate the addition of enumeration values.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the great suggestion, just add the comment per suggestion

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion +1

@tqchen
Copy link
Member Author

tqchen commented Jan 5, 2023

Thanks @wjakob for suggestions, I have updated the comment to include your suggested clarifications. Let us move discussion about python api to #116

Copy link
Contributor

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for one minor nit.

include/dlpack/dlpack.h Show resolved Hide resolved
include/dlpack/dlpack.h Show resolved Hide resolved
This PR adds DLPackManagedTensorVersioned to DLPack

- Add a new data structure DLManagedTensorVersioned, which contains a new version field.
- Add DLPackVersion struct to indicate both api and abi versions.
- Add flags field for boolean options, with support for read only bit mask.
- Add kDLDeviceTypeEnd to help dependent packages to check range of possible extension values.

The change is still ABI compatible, as the new data structure ABI comes with the DLManagedTensorVersioned.
The data-api however would involve an ABI update and update of all DLPack importer/exporters to make use of
DLManagedTensorVersioned.
Remove the device end notation for now.
Change bitmask definition to be macro to avoid enum int64
difference between C and C++.
@tqchen
Copy link
Member Author

tqchen commented Jan 6, 2023

@wjakob please let me know if you have more comments :)

@leofang leofang mentioned this pull request Jan 15, 2023
2 tasks
@tqchen
Copy link
Member Author

tqchen commented Jan 16, 2023

Going merge if there is no more input in the next three days

@tqchen tqchen merged commit ca4d00a into dmlc:main Jan 24, 2023
@tqchen
Copy link
Member Author

tqchen commented Jan 24, 2023

Thanks everyone for participating in discussion and reiews, this change is now merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants