[RFC] Improve TorchTune Extensibility and Build Interop with Ecosystem #442

kartikayk · 2024-03-04T02:02:33Z

Updates since last review:

Since the last review I've updated our checkpointing stack. The changes include:

Add FullModelTorchTuneCheckpointer, FullModelMetaCheckpointer and FullModelHFCheckpointer which handle all of the logic associated with Meta, HF and TorchTune checkpoints. This drastically simplifies the recipe UX (look at full_finetune.py
Fix the weirdness in our checkpointing story where we create state_dicts with model as a key even for final checkpoints. This is unintuitive and different from how any other lib and framework does this.
Since the change is only applied to full finetuning, I modigy convert-checkpoint so that lora recipe doesnt break. This is a temp hack and will go away once I make the change to LoRA.

PR Context

Note: A couple of notes as you read through this RFC:

Please read through the context before looking at code. Yes, I know its a long read.
The code is WIP prototype (a.k.a kinda sucks), but has been thoroughly tested locally. I do need to do some refactoring, add detailed unit tests and doc strings, update docs and tutorials. Before that effort, I'd like some initial thoughts.

Context

Building interoperability with the surrounding ecosystem is critical for TorchTune. This "off-ramp" i.e. the ease with which users can use the fine-tuned checkpoints with their favorite tools, is as important as the fine-tuning capabilities provided by the library. It's not an exaggeration to say that without having a strong interop story, it'll be hard for TorchTune to gain traction within the community.

Understanding the Landscape

Before we go deeper into building interoperability with the ecosystem, let's take a quick look at the current ecosystem.

HF Model Hub is the de-facto source for most (if not all) popular LLMs. Each model is associated with a checkpoint format. The checkpoint format is different from inference-time formats like GGUF. These refer to the model state dict and how it's presented to model users. At a high-level, checkpoint formats can be divided into two popular buckets:

Original. This refers to the original checkpoints from the LLM authors. For example, Meta provides a consolidated .pth file through the meta-llama repository on HF Model Hub. Various tools like llama.cpp directly build on top of this checkpoint by assuming the keys have a certain format.
HF Format. Popular models like Llama 7b also are available in the HF format through the HF Hub. Similar to above, tools like llama.cpp make assumptions on the state dict format for exporting models for inference. These checkpoints are usually available in multiple .bin files (or stored as safetensors) with an associated index.json file which provides information for building these state dicts back up.

Popular code bases like gpt-fast [script], GPTQ-for-Llama [script], llama.cpp [script] etc all depend on the above formats or provide the option to write custom convertors.

Given the above state, my claim is that we should build TorchTune to be "state-dict invariant" i.e.

convert checkpoints from popular formats into TorchTune's format -> train -> convert back to the original format. The rest of this RFC goes over this idea.

But a few FAQs before that:

[Question] Why do we need to convert into a "TorchTune format"? Can't we just directly use the model classes from these popular frameworks like Transformers?

The TorchTune modeling components and classes are built with modularity and flexibility in mind. Using Transformers negates this design principle and takes away "extensibility" as one of our core value proposition to users. It also negates our goal of being "native PyTorch" since these frameworks and libraries have strict structure which needs to be followed. gpt-fast has a similar structure where the code base first [converts the cpt].

[Question] What about models like Mistral and Gemma which are neither from Meta nor from HF.

The above applies to both Mistral [HF Repo] and Gemma [HF Repo].

What does "be state-dict invariant" mean for TorchTune?

This has a sizable impact on our current user experience, but the ROI is high since not only do we get a "built-in off-ramp", but adding new models becomes easier.

Our current flow looks something like this:

Download model from HF Hub (we artificially constrain this to the meta-llama/Llama-2-7b repo)
Convert this into TorchTune format using our convert script
Train, and save intermediate checkpoints using a (very) custom format
Save final checkpoint in the same format

The above flow means that for inference, we need to first convert the final checkpoint into a standard format (eg: GGUF or an Executorch-friendly-format) by writing a custom convertor, which can be substantial work [example]. Alternatively, we need to adopt a standard implementation of popular models which is also a no-go as mentioned above. As a result, adding new model support will be slow since we will need to build a new off-ramp for each model implementation.

The flow proposed by this RFC looks something like this:

Download model from HF Hub without major constraints. Eg: Any llama2 7B model which is in the original format or HF's format (eg: those from technium) should be usable OOTB
Convert the checkpoint into TorchTune's expected format
Train, and save intermediate checkpoints using our custom format
Convert final checkpoint BACK to the original format before writing out to file

To minimize cognitive load on the users, TorchTune recipes will handle conversions "to" and "from" the above formats instead of delegating this to the cli tool.

Concretely, the user would run the following:

# Download any model which is in a format supported by TorchTune; this will be a sizable list
tune download ... 

# Copy the config and recipe to custom folder; Modify config and/or recipe
tune  cp ...

# Run fine-tuning without the need for any conversion
tune run ...

# Take model checkpoint and directly use with tool of choice
python3  llama.cpp/convert.py ...

With the above flow, I'm directly able to convert the model into GGUF to run inference and quantization using llama.cpp as well as use gpt-fast for running generation and quantization.

What changes in code?

The user no longer needs to know about TorchTune's checkpoint format unless they're resuming training from a previously failed run. Specifically, we:

Add support for META_FORMAT for llama-2-7b which includes authoring the convert_llama2_from_meta_format and convert_llama2_to_meta_format functions for translating state dicts.
Expose CheckpointFormat through the recipe and config and explicitly ask the user to specify the checkpoint format (detailed documentation will make this clear)
Update utils.load_checkpoint to extract the state_dict from the original checkpoint rather than the translated checkpoint. This means adding support for both checkpoint files and checkpoint directories (checkpoints can be split across multiple files) which is done through the _fetch_meta_format_state_dict function which inturn is hooked up to utils.load_checkpoint through the load_external_checkpoint function
Update load_checkpoint in the recipe to translate state dict from original format to TorchTune's format. Behavior for resuming training remains the same (see below)
Keep the same functionality for when resume_from_checkpoint is True. We need the additional information in the state dict
Update save_checkpoint in the recipe to translate final state dict back to the original format.

How does this make things better?

We get two big advantages.

Adding support for a new format is straight forward

For adding the HF_FORMAT and opening up the repo to a large number of llama-7b models trained using transformers, we need to only:

Author convert_llama2_from_hf_format and convert_llama2_to_hf_format functions for translating state dicts.
Author _fetch_hf_format_state_dict which contains logic for extracting state dicts from multiple .bin files
And that's it!

With the HF_FORMAT added, support for llama2 13B and 30B models (at least on an 80GB A100) should be straight forward as well. Adding Mistral-7B and Gemma-7B should not be as much work as before either.

Running your fav inference tools is straight forward

The following conversion to GGUF using llama.cpp works OOTB

python3 convert.py ~/cpts/finetuned_llama/ --ctx 4096

where finetuned_llama contains the output checkpoints from a full-finetuning run in TorchTune.

Other FAQs

How does this work for PEFT?

For the MVP, we plan to provide "merged weights" (inference-time weight merging and downstream support is out-of-scope). This should be a straight-forward swap since the state dict keys remain the same.

Why not just keep the existing flow while adding a checkpoint convert at the end

Great question! A couple of reasons for this:

Asking the user to worry about checkpoint conversions and keeping track of different formats at different stages of the flow adds unnecessary cognitive load at best and is extremely annoying at worst. The user should specify this once in config and not have to worry about this anymore. This would still be worth considering if not for the second point below.
Our current flow is really hard to generalize to new models. Just adding support for a new format creates a 500-line (for now) PR.

The current code kinda sucks - I sort of understand what's going on, but there aren't any tests or doc strings. How do I know this actually works?

For now, take my word that it words! I'm working on updating the tests, adding detailed doc strings and even a tutorial on the overall flow. But before I put in all of that work, I'd like some initial feedback.

netlify · 2024-03-04T02:02:49Z

✅ Deploy Preview for torchtune-preview ready!

Name	Link
🔨 Latest commit	`f0cb0f6`
🔍 Latest deploy log	https://app.netlify.com/sites/torchtune-preview/deploys/65ef2db63588380008184144
😎 Deploy Preview	https://deploy-preview-442--torchtune-preview.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

rohan-varma · 2024-03-04T05:20:02Z

High level comment - totally agree with the proposal to support HF formats for popular models as well. Though wanted to get into detail about this point -

With the HF_FORMAT added, support for llama2 13B and 30B models (at least on an 80GB A100) should be straight forward as well.

Curious how the HF format specifically enables easier support for llama2 13B and 30B? AFAIK, meta format checkpoints are published for these models as well, so we should be able to use those formats just as easily as well. Just wondering whether the HF format provides something additional for the larger scale models specifically.

kartikayk · 2024-03-04T15:27:19Z

@rohan-varma Sorry this was poorly worded. Adding the HF_FORMAT for 7B meant adding support for reading the state dict from multiple checkpoint files. Llama 13B is split across multiple .pth files (each ~13GB in size) which would have needed custom handling. The overall point was that with this change the repo becomes less specific to llama-7B which makes adding new models and weights easier.

rohan-varma · 2024-03-04T16:56:58Z

@rohan-varma Sorry this was poorly worded. Adding the HF_FORMAT for 7B meant adding support for reading the state dict from multiple checkpoint files. Llama 13B is split across multiple .pth files (each ~13GB in size) which would have needed custom handling. The overall point was that with this change the repo becomes less specific to llama-7B which makes adding new models and weights easier.

@kartikayk That makes sense. It seems like we should discuss how we should read and write multiple files? IIUC, the files are checkpoint shards, so what I've been doing is using distributed to stitch them together so we can load into torchtune models, i.e. in https://github.com/pytorch-labs/torchtune/pull/404/files. Curious about the approach that you took to load 13b?

ebsmothers · 2024-03-04T17:04:39Z

Thanks for creating this RFC! A few questions I have on the proposal:

This makes perfect sense for Llama2-style models (and there are enough of them that this is a big lift), but what about other models that do not match this format (e.g. Mixtral)? Do we consider this a unique new format that we would support that then gets all the same treatment as Llama2 format in this proposal?
Sounds like we are making an explicit contract with users here.. "give us a checkpoint in format X and we will output a fine-tuned checkpoint in format X". I know you mention PEFT, I am curious how this contract will hold up in adapters that are not linear like LoRA (e.g. bottleneck adapters come to mind). In such cases, do we break the contract, opt not to support them, or something else entirely?
One thing I am still trying to reason about is our interaction with Hugging Face here. Currently we have an implicit dependency via our conversion scripts, but in a sense this is integrating this implicit dependency more tightly into our core offering (e.g. if something about their checkpoint format changes we will be completely broken until we can integrate it into our lib). I think it's a reasonable claim that the UX lift is enough to justify this, just want to call it out.
Kind of related to both (1) and (3).. how do we envision maintenance of this flow down the line? Right now it is easy enough to track one set of state dict mappings, but want to make sure we do not just become a state dict remapping library. E.g. lit-gpt has their own format for Llama2, if we want to support checkpoints fine-tuned on there we need a separate utility. Prior to this change I think it's easier for us to tell users to just figure it out, but now I can imagine more of an expectation that we will support such things out of the box. And (I know I don't need to tell you) maintaining a ton of different state dict mappings can get quite messy.
Just for my own understanding, what is our exact definition of "TorchTune format"? Is it state dict key equivalence (plus I guess the assumption that our model is TransformerDecoder class)? Are there any other constraints on parameter shapes? (Obv 7B, 13B, etc. all have their expected shapes, I am thinking more from the inference side of things here.)
Similar question to @rohan-varma: how do we define our output format in the case where our final checkpoint has to be saved in a distributed fashion?

Sorry this wound up being more than a few questions .. 😅 .. let me know if any of these comments are unclear.

kartikayk · 2024-03-04T17:05:05Z

@rohan-varma why do I need distributed for this? Here's a simple code pointer from gpt-fast on how they handle the bin files from HF (seems pretty standard) and I should be able to handle the pth files similarly. Or do I misunderstand?
https://github.com/pytorch-labs/gpt-fast/blob/main/scripts/convert_hf_checkpoint.py#L67

kartikayk · 2024-03-04T17:19:34Z

@ebsmothers great questions! A few thoughts:

but what about other models that do not match this format

If you take a close rlook, the format isn't model specific. It's "training framework" specific. Adding support for a model like Mistral isn't a new format. It's a new convert function which likely has a similar/same key mapping dict.

Sounds like we are making an explicit contract with users here.. "give us a checkpoint in format X and we will output a fine-tuned checkpoint in format X". I know you mention PEFT, I am curious how this contract will hold up in adapters that are not linear like LoRA

We'll need to figure this out once we have a concrete method that we can take a look at. But the answer is that we should align with the inference tooling on this. And that's the contract I propose here.

One thing I am still trying to reason about is our interaction with Hugging Face here

I view this as an interaction with the ecosystem which happens to depend on HF formats for everything. You're right that if things break that we'll have to go in and fix stuff, but the claim is that this is true generally. If tomorrow some other format becomes popular, we should just align with that. Does this make sense?

Kind of related to both (1) and (3).. how do we envision maintenance of this flow down the line? Right now it is easy enough to track one set of state dict mappings, but want to make sure we do not just become a state dict remapping library. E.g. lit-gpt has their own format for Llama2, if we want to support checkpoints fine-tuned on there we need a separate utility. Prior to this change I think it's easier for us to tell users to just figure it out, but now I can imagine more of an expectation that we will support such things out of the box. And (I know I don't need to tell you) maintaining a ton of different state dict mappings can get quite messy.

A few thoughts on this:

I don't think we need to explicitly support lit-gpt format or any other custom format. I think it's a pretty fair ask for users to take care of these conversions themselves. We have enough examples for how this can be done.
The maintenance story is going to be rough, but we'll need to figure this out even if we write custom adapters for every popular inference, eval and quantization tool out there.

Just for my own understanding, what is our exact definition of "TorchTune format"

Simply put - format of state dict that you can load into our model class i.e. model.load_state_dict() works.

how do we define our output format in the case where our final checkpoint has to be saved in a distributed fashion?

Responded a bit, I don't think I understand what "distributed fashion means here". See the code pointer above, the files in the HF repo for 13, 70B models. Maybe I'm missing something, so need to understand this more.

rohan-varma · 2024-03-05T17:12:20Z

Responded a bit, I don't think I understand what "distributed fashion means here". See the code pointer above, the files in the HF repo for 13, 70B models

So IIUC, in those repos, what @ebsmothers means by the files are in a distributed fashion is that each model related file is a sharded checkpoint of the entire model.

For example, llama 2 70b checkpoint has 8 checkpoint files: https://huggingface.co/meta-llama/Llama-2-70b/tree/main. Each file contains the rank's owning shard from row / column parallel sharding. When converting to / from torchtune format, we'll need to appropriately unshard - which to me, requires being aware if the particular tensor was either column or row parallel sharded.

kartikayk · 2024-03-05T17:38:30Z

Each file contains the rank's owning shard from row / column parallel sharding

I don't think this is necessarily true for the HF checkpoint, but I also need to do a lot more HW for 70B. Let's punt the discussion on 70B though. I think the complexity that 70B brings is orthogonal to this RFC. I'm also in the process of redesigning the entire checkpointing stack since this is sub-optimally written in its current form. This will allow us to decouple the checkpointing across model sizes (7b vs 70b) and across finetuning methods (full vs Lora), which makes sense to me since a user fine-tuning a 7B or smaller model doesnt need to care about any of the complexity associated with the 70B model.

ebsmothers

The FullModelCheckpointer generally makes sense to me, but I also think this is the easy case 😃. The main thing I'd push on here is more details around common PEFT partial save/load flows and making sure we properly support distributed checkpointing APIs (on that second point I would take a look at #443 if you haven't already)

ebsmothers · 2024-03-06T16:58:22Z

torchtune/utils/_checkpointing/_checkpointer.py

+
+        # different formats for the same model have different dtypes. Track this
+        # to convert weights to the right format before writing to file
+        self._checkpoint_dtype = None


Any use case for different weights in different dtypes?

ebsmothers · 2024-03-06T17:00:27Z

torchtune/utils/_checkpointing/_checkpointer.py

+        if intermediate_checkpoint:
+            # Add the relevant checkpoint format information to the state dict
+            checkpoint_dict["checkpoint_dtype"] = self._checkpoint_dtype
+            checkpoint_dict["weight_map"] = self._weight_map
+            checkpoint_dict["checkpoint_format"] = self._checkpoint_format.name
+
+            # We write to a single ".pt" file irrespective of the extension provided by the recipe
+            output_path = Path.joinpath(self._output_dir, intermediate_checkpoint_name).with_suffix(".pt")
+            torch.save(checkpoint_dict, output_path)
+            logger.info(
+                "Model checkpoint of size "
+                f"{os.path.getsize(output_path) / 1000**3:.2f} GB "
+                f"saved to {output_path}"
+            )
+        else:


more of a nit, but if the whole method is an if/else maybe it should be split into e.g. save_intermediate_checkpoint and save_final_checkpoint methods?

ebsmothers · 2024-03-06T17:11:16Z

torchtune/utils/_checkpointing/_checkpointer.py

+        * Mid-training Chekpointing. In this case the state-dict contains more information
+        than just the model weights. It also contains the optimizer state dict and the training
+        state of the recipe needed to correctly restart training. The construction of the
+        state-dict, including figuring out what information is needed to correctly resume
+        training, is handled by the recipe. The checkpointer doesn't know or care about how the
+        state-dict is constructed. In this scenario, the checkpointer simply adds additional
+        information about the original checkpoint (eg: format, weight map etc) to ensure the
+        final checkpoint is constructured correctly in case the current training run fails and
+        needs to be resumed. Intermediate checkpoints don't require any conversion since these
+        are directly saved in the ``TORCHTUNE_FORMAT``.


If the claim is that we are using this Checkpointer class to define and enforce a unified TorchTune format, the mid-training checkpointing contract here feels a little strange. It also introduces a lot of coupling between the recipe and the checkpointing class (for instance, the recipe is responsible for defining the logic to save an intermediate checkpoint, but the checkpointer then needs to know about this when loading the intermediate checkpoint). Lmk if I'm misunderstanding this though

@ebsmothers actually the intent here is opposite of what you mentioned. Right now the checkpointer and the recipe are really intertwined with the checkpointing under utils having to know about recipe specific keys and how the checkpoint is formatted. With this change, the separation of concerns is a lot cleaner i.e:

The recipe handles all of the logic for preparing intermediate checkpoints (it knows what state it needs to resume correctly)

The checkpointer handling the external checkpoint loading and final checkpoint saving.

Generally there will be some coupling between the two because the checkpointer is meant to be used in the recipe. But generally checkpointers should be recipe agnostic. Let me know if this makes sense.

torchtune/utils/_checkpointing/_checkpointer.py

ebsmothers · 2024-03-06T17:39:04Z

torchtune/utils/_checkpointing/_checkpointer.py

+            for key in state_dict.keys():
+                self._weight_map[key] = ckpt_file.name
+            merged_state_dict.update(state_dict)


What if params are sharded across files? I.e. each individual file contains a subset of the weights for a single state dict key

This should be a new convertor. I don't have a case for this right now till we add support for 13B or 70B. But I expect that to go in a different checkpointer class (or maybe we can update this one if it makes sense).

kartikayk · 2024-03-06T18:47:40Z

@ebsmothers thank you for the review

he main thing I'd push on here is more details around common PEFT partial save/load flows and making sure we properly support distributed checkpointing APIs

So this PR won't address this complexity. My main point is that having a separate training component will make things easier to separate i.e. not have FFT users worry about PEFT checkpointing. Anything from the design specifically that stops us from doing this? In the worst case I can map utils.save_checkpoint to the save_checkpoint in this class and things will work just fine?

kimishpatel · 2024-03-08T15:26:29Z

One question I have about
# Take model checkpoint and directly use with tool of choice python3 llama.cpp/convert.py ...

Given "Convert final checkpoint BACK to the original format before writing out to file", the existence of llama.cpp/convert.py is just a convenience util, right? I can take the "checkpoint that is converted back to original format" to llama.cpp repo as well, right? To me the latter feels more composable compare to writing thin wrapper. Thoughts?

kartikayk · 2024-03-08T15:41:43Z

@kimishpatel Sorry I should clarify this point better.

convert.py is a file in the llama.cpp repo which is used to convert a given checkpoint to GGUF format for using with their inference stack. The point I make is that since we convert the checkpoint back, you can directly use the standard convert script in their repo instead of having to write a new one. Does this make sense?

ebsmothers · 2024-03-11T01:00:30Z

README.md

 ```

 The argument passed to `--nproc_per_node` can be varied depending on how many GPUs you have. A full finetune can be memory-intensive, so make sure you are running on enough devices. See [this table](https://github.com/pytorch-labs/torchtune/blob/main/README.md#finetuning-resource-requirements) for resource requirements on common hardware setups.

-Similarly, you can finetune with LoRA on the Alpaca dataset on two devices via
+Similarly, you can finetune with LoRA on the Alpaca dataset on two devices via the following. Remember to convert your
+model with ```train_type``` set to ```lora'```


Suggested change

model with ```train_type``` set to ```lora'```

model with `train_type` set to `lora`

ebsmothers · 2024-03-11T01:04:57Z

tests/recipes/test_full_finetune.py

+        ckpts = (
+            ["llama2.llama2_7b"]
+            if large_scale
+            else [
+                "small_test_ckpt_tune",
+                "small_test_ckpt_hf",
+                "small_test_ckpt_meta",
+            ]
+        )


This should just be a @pytest.mark.parametrize, no?

I thought about that and I don't think that'll work when we pass in --large-scale? I actually don't know how that works, so went with a manual for loop. Let me know if you think that will work.

Hmmm good point. I don't immediately see how to do this. Ideally we should do this more cleanly, but no need to block this PR on it

ebsmothers · 2024-03-11T01:08:29Z

tests/recipes/test_full_finetune.py

@@ -164,15 +228,21 @@ def test_gradient_accumulation(
        # We use a tiny model to reduce the error accumulation in the test
        # It's impossible to make a large model produce the same loss values
        # in the same way as the full batch size.
-        model_ckpt = "llama2_tiny_test_ckpt"
+        model_ckpt = "small_test_ckpt_tune"


Oh this works? I thought we needed the tiny checkpoint for the accumulation of errors to be small enough

Hmm, it works for me :)

ebsmothers · 2024-03-11T01:11:53Z

tests/recipes/test_full_finetune.py

        cmd = f"""
        tune full_finetune \
            --config {_CONFIG_PATH} \
            --override \
            model._component_=torchtune.models.{model_ckpt} \
-            model_checkpoint={fetch_ckpt_model_path(model_ckpt)} \
+            checkpointer._component_=torchtune.utils.FullModelTorchTuneCheckpointer \


nit: you could just write a single base command, then append micro_batch_size cfg for the first invocation and gradient_accumulation_steps cfg for the second invocation, just to make it clearer that's all that's changing. But obv it was in this state before you got here so not a huge deal if you don't do this

I'll follow up with changs beyond this PR in a separate PR. It's complex as it is

ebsmothers · 2024-03-11T01:49:03Z

tests/torchtune/utils/test_checkpointer.py

+        assert len(output_state_dict_1.keys()) + 1 == len(orig_state_dict_1.keys())
+        assert len(output_state_dict_2.keys()) + 1 == len(orig_state_dict_2.keys())


inv freq accounts for the +1 here as well?

Yes, we have inv_freq in each layer. Since each dict corresponds to a single layer, both of them are impacted by 1.

ebsmothers · 2024-03-11T02:52:54Z

torchtune/utils/_checkpointing/_checkpointer.py

+
+class FullModelTorchTuneCheckpointer(_CheckpointerInterface):
+    """
+    Checkpointer which reads and writes "full-model" checkpoints in a format compatible with


nit: I think the definition of "full-model" is not sufficiently clear from the context here

ebsmothers · 2024-03-11T02:53:44Z

torchtune/utils/_checkpointing/_checkpointer.py

+        self._checkpoint_path = Path.joinpath(self._checkpoint_dir, checkpoint_files[0])
+        if (
+            not self._checkpoint_path.is_file()
+            or not self._checkpoint_path.suffix == ".pt"


Just curious: is the .pt extension a hard requirement? Clearly we've had non-pt extensions floating around for a while without really breaking anything

Its very unintuitive, I'd just make this a convention for checkpoints we write

ebsmothers · 2024-03-11T02:54:41Z

torchtune/utils/_checkpointing/_checkpointer.py

+        # if resume_from_checkpoint is True, recipe_state.pt should contain the recipe state
+        if self._resume_from_checkpoint:
+            self._recipe_state_file = get_recipe_checkpoint_path(
+                self._checkpoint_dir, filename="recipe_state.pt"


Would consider defining recipe_state.pt as a constant somewhere (maybe class-level) for increased visibility

Also I know you mentioned saving as a JSON previously, curious if there's any particular reason for switching to .pt?

Good point about the constants. Let me generalize this a bit when I do the LoRA PR.

Optim state can be quite heavy (as large as model checkpoints) and so saving as json was a bad idea.

ebsmothers · 2024-03-11T02:57:10Z

torchtune/utils/_checkpointing/_checkpointer.py

+            Model:
+                {
+                    "key_1": weight
+                    ...
+                }
+
+            Recipe State:


Would maybe explicitly add a comment about the filename each of these is saved to (I know you already mention recipe_state.pt above). At first glance I kinda thought this was all a single state dict, I think adding filenames explicitly may make it harder to make that mistake.

ebsmothers · 2024-03-11T02:59:47Z

torchtune/utils/_checkpointing/_checkpointer.py

+        self._resume_from_checkpoint = resume_from_checkpoint
+
+        # weight_map contains the state_dict key -> checkpoint file mapping so we can correctly
+        # parition the state dict into output checkpoint files. This is updated during checkpoint


nit

Suggested change

# parition the state dict into output checkpoint files. This is updated during checkpoint

# partition the state dict into output checkpoint files. This is updated during checkpoint

ebsmothers · 2024-03-11T03:02:07Z

torchtune/utils/_checkpointing/_checkpointer.py

+                        f"Found {type(value)} instead."
+                    )
+                # idx is written in the 4 digit format (eg: 0001, 0002, etc.)
+                self._weight_map[key] = f"{cpt_idx+1:04}"


Probably a dumb q: why don't we just use the filename directly as the key here rather than going through all the sorting and indexing business?

So the file names here are of the form pytorch_model-00001-of-00003.bin, pytorch_model-00002-of-00003.bin and so on. The ID (00001) is important since the weights are written in this order. But the entire filename doesnt serve any purpose (for now). So I simplified all of the filename logic and just added the ID by making sure the incoming names are lexicographically sorted.

ebsmothers · 2024-03-11T03:03:23Z

torchtune/utils/_checkpointing/_checkpointer.py

+            num_heads=self._config["num_attention_heads"],
+            num_kv_heads=self._config["num_key_value_heads"],
+            dim=self._config["hidden_size"],


Do these keys always exist?

For llama models they always do. I checked a few others and they do as well. But might be a good idea to add a try-catch block here

ebsmothers · 2024-03-11T03:07:37Z

torchtune/utils/_checkpointing/_checkpointer.py

+        # If the recipe state needs to be output, first remove the model state dict
+        if intermediate_checkpoint:
+            _ = state_dict.pop("model")
+            torch.save(state_dict, Path.joinpath(self._output_dir, "recipe_state.pt"))


Just to check my understanding: when saving intermediate checkpoints, we still save in the input format (as opposed to TorchTune format, which we were doing previously)? And then we just supplement with the additional recipe_state.pt file?

Yeh thats right. This way if I want to run inference or eval on intermediate checkpoints, I dont need to do some conversion. Also it simplifies save significantly which is nice

ebsmothers · 2024-03-11T03:10:03Z

torchtune/utils/_checkpointing/_checkpointer.py

+    Checkpointer which reads and writes "full-model" checkpoints in Meta's format. Example includes
+    the Llama-2-7b model from the meta-llama repo (https://huggingface.co/meta-llama/Llama-2-7b)


Same link as the HF checkpointer?

I think it's different?

Missed those last two letters..

ebsmothers · 2024-03-11T03:12:06Z

torchtune/utils/_checkpointing/_checkpointer.py

+
+    def load_checkpoint(self) -> Dict[str, Any]:
+        """
+        Load TorchTune checkpoint from file. Currently only loading from a single file is supported.


Similar comment here, is this line copy-paste from the equivalent in FullModelTorchTuneCheckpointer? Just generally, make sure to update docstrings for all of these methods

ebsmothers · 2024-03-11T03:19:23Z

torchtune/utils/_checkpointing/_checkpointer.py

+            }
+        """
+        state_dict: Dict[str:Any] = {}
+        state_dict["model"] = safe_torch_load(self._checkpoint_path)


I thought we talked about getting rid of the "model" key? I might be missing something/misremembering though, lmk if so

I get rid of "model" for checkpoint save. For the state dicts being send to the recipe we still need this to have a single dict with both model and recipe state

ebsmothers

Left a bunch more comments, but I think all the major design concerns from my side are addressed in this latest version. So modulo my open comments, this looks good to me.

rohan-varma

This is a legendary PR, as usual! Couple of points I want to make sure we discuss prior to landing (also mostly okay with discussing these post land if absolutely needed, save for the question around security and extending to sharded checkpoints):

User having to specify the train-type when converting their ckpts into our training format seems a bit unintuitive to me. Could you clarify we we have to do this for the moment, what's blocking us from not having to specify this, and how we can get there?
Seems like we would need to write a LoRA specific checkpointer for this to work with LoRA. Is the reason the same as (1), and can we have a discussion here?
Curious about the del state_dict and gc collect - were you seeing memory issues here?
Not directly related to this PR, but I'm realizing having tune download + operating on the checkpoints could introduce a security risk if we allow users to specify arbitrary checkpoint paths - we're downloading somewhat arbitrary checkpoint files from a third-party and running python code on them. This could also be on sensitive HW such as company devices. See https://www.darkreading.com/application-security/hugging-face-ai-platform-100-malicious-code-execution-models for potential attacks. Any thoughts on this security risk and how we can mitigate? One super easy win seems to just be to have an allowlist of checkpoints we've vetted, and crash tune download on any other checkpoints by default?
Extensiblity to sharded checkpoints. Details on this point are in the CR comments.

rohan-varma · 2024-03-11T07:40:03Z

.github/workflows/recipe_integration_test.yaml

@@ -56,6 +56,9 @@ jobs:
            mkdir -p /tmp/test-artifacts
            aws s3 cp s3:https://pytorch-multimodal/llama2-7b/tokenizer.model /tmp/test-artifacts
            aws s3 cp s3:https://pytorch-multimodal/llama2-7b-01242024 /tmp/test-artifacts
+            aws s3 cp s3:https://pytorch-multimodal/small-ckpt-hf-03082024.pt /tmp/test-artifacts
+            aws s3 cp s3:https://pytorch-multimodal/small-ckpt-tune-03082024.pt /tmp/test-artifacts
+            aws s3 cp s3:https://pytorch-multimodal/small-ckpt-meta-03082024.pt /tmp/test-artifacts


This is going to get pretty annoying to hardcode this tempfile directory, as users on the same box can overwrite each other's stuff / not have access to this directory. We should at least add some sort of unique id to this.

rohan-varma · 2024-03-11T07:44:05Z

torchtune/_cli/convert_checkpoint.py

+    parser.add_argument(
+        "--train-type",
+        type=str,
+        help="Type of finetuning. Currently Full-Finetuning and LoRA have slightly different formats. "


Hmm I'd like to discuss and understand more deeply around this. User having to specify the train type into checkpoint conversion is quite unintuitive, and ideally checkpoint conversion shouldn't have to know about whether the model is going to be used for a full or LoRA finetune at all - it should just produce a checkpoint format that any torchtune training recipe (at least the ones that we write and endorse) can consume.

It also presents additional overhead to the user - I have to run separate conversion scripts to use different finetune techniques - I know currently it's a one time overhead, but worth pointing out + overhead will scale as we introduce more models.

I'm not able to tell from the PR description the blocker in just enabling a consistent format for both finetune techniques - mind elaborating? thanks!

This is a temp hack till I fix LoRA. This PR just makes the change for full finetune. I'll follow up shortly with a PR that fixes this for LoRA and will remove this. I'll add this to the PR description

rohan-varma · 2024-03-11T07:44:24Z

torchtune/_cli/convert_checkpoint.py

@@ -33,6 +34,7 @@ def convert_checkpoint(
        checkpoint_path (Path): Path to the checkpoint path.
        model (str): Model name
        output_path (Optional[Path]): Path to the output checkpoint.
+        train_type (str): Type of finetuning


Add "must be full or lora"?

rohan-varma · 2024-03-11T07:45:24Z

recipes/configs/alpaca_llama2_full_finetune.yaml

+checkpointer:
+  _component_: torchtune.utils.FullModelTorchTuneCheckpointer
+  checkpoint_dir: /tmp/llama2/
+  checkpoint_files: [llama2_native.pt]


Is this the file we will write out? what's the plan for scaling this when we need to produce checkpoint shards (at least for intermediate checkpoints)?

This is the file we read from. The output file will be written to output_dir and matches the input format. See the checkpointer doc string

rohan-varma · 2024-03-11T07:45:41Z

recipes/configs/alpaca_llama2_full_finetune.yaml

+  _component_: torchtune.utils.FullModelTorchTuneCheckpointer
+  checkpoint_dir: /tmp/llama2/
+  checkpoint_files: [llama2_native.pt]
+  model_type: LLAMA2


where can I, as a user, find the supported "model_type"'s?

I'll add more info on this after the LoRA change. For now its a copy paste for the config

rohan-varma · 2024-03-11T08:12:35Z

torchtune/utils/_checkpointing/_checkpointer.py

+
+    def load_checkpoint(self) -> Dict[str, Any]:
+        """
+        Load TorchTune checkpoint from file. Currently only loading from a single file is supported.


Do we mean load a meta checkpoint?

rohan-varma · 2024-03-11T08:13:14Z

torchtune/utils/_checkpointing/_checkpointer.py

+        intermediate_checkpoint: bool = False,
+    ) -> None:
+        """
+        Save TorchTune checkpoint to file. If ``intermediate_checkpoint`` is True, an additional


"Save torchtune checkpoint to file, in meta format"?

rohan-varma · 2024-03-11T08:13:58Z

torchtune/utils/_checkpointing/_checkpointer.py

+            torch.save(state_dict, Path.joinpath(self._output_dir, "recipe_state.pt"))
+
+
+class FullModelMetaCheckpointer(_CheckpointerInterface):


This is great but I'm curous about the utility of the save portion of this file. Are meta checkpoints really used for any off ramps at the moment? They might be and I'm just unaware - do you have a list of off ramps that directly consume the meta style checkpoints?

Llama.cpp for example was built with the Meta format in mind and then extended to handle HF checkpoints. generally its a bad idea to not support the original model format. For future llama versions we'll jsut launch with this format

rohan-varma · 2024-03-11T08:14:58Z

torchtune/utils/_checkpointing/_checkpointer.py

+            state_dict.update(recipe_state)
+        return state_dict
+
+    def save_checkpoint(


Another comment about saving in meta format. Do we anticipate use cases where user wants to load meta checkpoint --> train in torchtune --> output HF checkpoint? This seems decently natural if HF checkpoint format supports a lot of off ramps, should we support this, or is there a way user can work around to implement this with this work?

I definitely don't want to include the cross conversion complexity here. We can think about providing some conversion scripts for this scenario

rohan-varma · 2024-03-11T08:15:40Z

torchtune/utils/_checkpointing/_checkpointer_utils.py

+    TORCHTUNE_RESTART = "torchtune_restart"
+
+
+class ModelType(Enum):


I don't fully follow why LoRA should have a different checkpoint, any details we can get here?

kartikayk · 2024-03-11T14:50:41Z

Thanks for the detailed reviews @ebsmothers and @rohan-varma.

User having to specify the train-type when converting their ckpts into our training format seems a bit unintuitive to me.

This is just a hack since I dont have LoRA addressed in this PR. I'll work on that later today and remove this

Seems like we would need to write a LoRA specific checkpointer for this to work with LoRA. Is the reason the same as (1), and can we have a discussion here?

Not for the MVP. We'll assume we always merge the weights and write a full checkpoint. Once we add adapter checkpointing we'll need to alter save and load and I'd rather have this as a separate class than to just add complexity to the full checkpointer

Curious about the del state_dict and gc collect - were you seeing memory issues here?

This needs a bit more testing before I can remove this. Seems like no harm to leave it there for now?

Not directly related to this PR, but I'm realizing having tune download + operating on the checkpoints could introduce a security risk if we allow users to specify arbitrary checkpoint paths - we're downloading somewhat arbitrary checkpoint files from a third-party and running python code on them. This could also be on sensitive HW such as company devices. See https://www.darkreading.com/application-security/hugging-face-ai-platform-100-malicious-code-execution-models for potential attacks. Any thoughts on this security risk and how we can mitigate? One super easy win seems to just be to have an allowlist of checkpoints we've vetted, and crash tune download on any other checkpoints by default?

It's unclear. We can restrict it, but its an arbitrary restriction and is annoying. Currently we restrict and I just go uncomment this. But not sure what the right way to handle this is other than just letting Hub handle this. We can also support safe tensors, but I dont want to take the transformers depedency.

Extensiblity to sharded checkpoints. Details on this point are in the CR comments.

Responded

kartikayk requested a review from joecummings March 4, 2024 02:02

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 4, 2024

kartikayk requested review from ebsmothers and msaroufim March 4, 2024 02:02

rohan-varma self-requested a review March 4, 2024 05:08

ebsmothers reviewed Mar 6, 2024

View reviewed changes

kartikayk force-pushed the b_interop branch from d1fe2fe to 0da2fc4 Compare March 9, 2024 02:23

kartikayk added 11 commits March 9, 2024 14:45

Support multiple checkpoint formats

cb6ce6d

New checkpointing stack

23cb1dd

Add checkpointer tests

9ff927b

Address comments

97f73f8

Add mid-training checkpoint test

3b53f2d

Clean up tests

43153f0

Restructure Checkpointer

b03f415

Fix recipe tests

d60227e

Lint

b8820f6

Lint

acbb0c6

Lint

c076fac

kartikayk added 3 commits March 10, 2024 17:03

Simplify checkpointing further

59e3183

Address comments

75de3b3

Lint

548309c

ebsmothers reviewed Mar 11, 2024

View reviewed changes

fix recipe test github workflow

9096f1b

ebsmothers reviewed Mar 11, 2024

View reviewed changes

ebsmothers approved these changes Mar 11, 2024

View reviewed changes

rohan-varma reviewed Mar 11, 2024

View reviewed changes

kartikayk added 2 commits March 11, 2024 08:49

Address comments

76809af

Lint

f0cb0f6

kartikayk merged commit 4c1fa52 into main Mar 11, 2024
17 checks passed

kartikayk deleted the b_interop branch March 11, 2024 20:00

kartikayk mentioned this pull request Mar 13, 2024

Update Checkpointing to support Adapter Weights #494

Merged

ebsmothers mentioned this pull request Mar 15, 2024

Explicitly install torch in Github workflows #505

Closed

kartikayk mentioned this pull request Mar 18, 2024

Remove repo download restrictions #517

Merged

	model with ```train_type``` set to ```lora'```
	model with `train_type` set to `lora`

		assert len(output_state_dict_1.keys()) + 1 == len(orig_state_dict_1.keys())
		assert len(output_state_dict_2.keys()) + 1 == len(orig_state_dict_2.keys())

	# parition the state dict into output checkpoint files. This is updated during checkpoint
	# partition the state dict into output checkpoint files. This is updated during checkpoint

		Checkpointer which reads and writes "full-model" checkpoints in Meta's format. Example includes
		the Llama-2-7b model from the meta-llama repo (https://huggingface.co/meta-llama/Llama-2-7b)

		torch.save(state_dict, Path.joinpath(self._output_dir, "recipe_state.pt"))


		class FullModelMetaCheckpointer(_CheckpointerInterface):

		TORCHTUNE_RESTART = "torchtune_restart"


		class ModelType(Enum):

[RFC] Improve TorchTune Extensibility and Build Interop with Ecosystem #442

[RFC] Improve TorchTune Extensibility and Build Interop with Ecosystem #442

Conversation

kartikayk commented Mar 4, 2024 • edited Loading

Updates since last review:

PR Context

Context

Understanding the Landscape

What does "be state-dict invariant" mean for TorchTune?

What changes in code?

How does this make things better?

Adding support for a new format is straight forward

Running your fav inference tools is straight forward

Other FAQs

netlify bot commented Mar 4, 2024 • edited Loading

✅ Deploy Preview for torchtune-preview ready!

rohan-varma commented Mar 4, 2024

kartikayk commented Mar 4, 2024

rohan-varma commented Mar 4, 2024

ebsmothers commented Mar 4, 2024

kartikayk commented Mar 4, 2024

kartikayk commented Mar 4, 2024

rohan-varma commented Mar 5, 2024 • edited Loading

kartikayk commented Mar 5, 2024

ebsmothers left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kartikayk commented Mar 6, 2024

kimishpatel commented Mar 8, 2024

kartikayk commented Mar 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kartikayk Mar 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ebsmothers left a comment

Choose a reason for hiding this comment

rohan-varma left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kartikayk commented Mar 4, 2024 •

edited

Loading

netlify bot commented Mar 4, 2024 •

edited

Loading

rohan-varma commented Mar 5, 2024 •

edited

Loading

kartikayk Mar 11, 2024 •

edited

Loading