Refactor everything outside of core to be out of the main megatron. n…

…amespace.
zTaoplus · Mar 26, 2024 · 38644dd · 38644dd
1 parent dc7fa88
commit 38644dd
Show file tree

Hide file tree

Showing 159 changed files with 478 additions and 605 deletions.
diff --git a/README.md b/README.md
@@ -157,7 +157,7 @@ The [`examples/pretrain_bert.sh`](./examples/pretrain_bert.sh) script runs singl
 
 The logging, checkpoint-saving, and evaluation interval options are specified. Note that the `--data-path` now includes the additional `_text_sentence` suffix added in preprocessing, but does not include the file extensions.
 
-Further command line arguments are described in the source file [`arguments.py`](./megatron/arguments.py).
+Further command line arguments are described in the source file [`arguments.py`](./megatron/training/arguments.py).
 
 To run `examples/pretrain_bert.sh`, make any desired modifications including setting the environment variables for `CHECKPOINT_PATH`, `VOCAB_FILE`, and `DATA_PATH`. Make sure to set these variables to their paths in the container. Then launch the container with Megatron and necessary paths mounted (as explained in [Setup](#setup)) and run the example script.
 
@@ -167,7 +167,7 @@ The `examples/pretrain_gpt.sh` script runs single GPU 345M parameter GPT pretrai
 
 It follows largely the same format as the previous BERT script with a few notable differences: the tokenization scheme used is BPE (which requires a merge table and a `json` vocabulary file) instead of WordPiece, the model architecture allows for longer sequences (note that the max position embedding must be greater than or equal to the maximum sequence length), and the `--lr-decay-style` has been set to cosine decay. Note that the `--data-path` now includes the additional `_text_document` suffix added in preprocessing, but does not include the file extensions.
 
-Further command line arguments are described in the source file [`arguments.py`](./megatron/arguments.py).
+Further command line arguments are described in the source file [`arguments.py`](./megatron/training/arguments.py).
 
 `examples/pretrain_gpt.sh` can be launched the same way as described for BERT. Set the env vars and make any other modifications, launch the container with appropriate mounts, and run the script.
 
@@ -290,7 +290,7 @@ python preprocess_data.py \
  --workers 5 # works well for 10 CPU cores. Scale up accordingly.
 </pre>
 
-2. Use a custom samples mapping function in place of `megatron/data/realm_dataset_utils.get_block_samples_mapping` if required. To do this, you will need to implement a new function in C++ inside of `megatron/data/helpers.cpp`. The samples mapping data structure is used to select the data that will constitute every training sample in advance of the training loop.
+2. Use a custom samples mapping function in place of `megatron/legacy/data/realm_dataset_utils.get_block_samples_mapping` if required. To do this, you will need to implement a new function in C++ inside of `megatron/core/datasets/helpers.cpp`. The samples mapping data structure is used to select the data that will constitute every training sample in advance of the training loop.
  The samples mapping is responsible for holding all of the required metadata needed to construct the sample from one or more indexed datasets. In REALM, the samples mapping contains the start and end sentence indices, as well as the document index (to find the correct title for a body) and a unique ID for every block.
 3. Pretrain a BERT language model using `pretrain_bert.py`, with the sequence length equal to the block size in token ids. This model should be trained on the same indexed dataset that is used to supply the blocks for the information retrieval task.
 In REALM, this is an uncased bert base model trained with the standard hyperparameters.
@@ -384,7 +384,7 @@ You can also use CURL or any other tools to query the server directly:
 curl 'http:https://localhost:5000/api' -X 'PUT' -H 'Content-Type: application/json; charset=UTF-8' -d '{"prompts":["Hello world"], "tokens_to_generate":1}'
 </pre>
 
-See [megatron/text_generation_server.py](megatron/text_generation_server.py) for more API options.
+See [megatron/inference/text_generation_server.py](megatron/inference/text_generation_server.py) for more API options.
 
 ### Detoxify GPT via Self-generation
 We include an example in `examples/detxoify_lm/` to detoxify language models by leveraging the generative power of language models.
@@ -531,10 +531,10 @@ The Llama-2 [family of models](https://ai.meta.com/llama/) are an open-source se
 The Llama-2 checkpoints can be loaded into Megatron for inference and finetuning. See documentation [here](docs/llama2.md).
 
 # Model Optimization and Deployment
-Megatron-Core (MCore) `GPTModel` family supports advanced quantization algorithms and high-performance deployment through TensorRT-LLM.
+Megatron-Core (MCore) `GPTModel` family supports advanced quantization algorithms and high-performance inference through TensorRT-LLM.
 
 ## Quantization and TensorRT-LLM Deployment
-See [Megatron Model Optimization and Deployment](examples/deploy/README.md) for `llama2` and `nemotron3` examples.
+See [Megatron Model Optimization and Deployment](examples/inference/README.md) for `llama2` and `nemotron3` examples.
 
 # Datasets
 We do not host any datasets for GPT or BERT training, however, we detail their collection so that our results may be reproduced.

diff --git a/examples/detxoify_lm/finetune_gpt.py b/examples/detxoify_lm/finetune_gpt.py
@@ -10,19 +10,19 @@
 import sys
 sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__),
  os.path.pardir, os.path.pardir)))
-from megatron import get_args
-from megatron import get_timers
-from megatron import get_tokenizer
-from megatron import print_rank_0
+from megatron.training import get_args
+from megatron.training import get_timers
+from megatron.training import get_tokenizer
+from megatron.training import print_rank_0
 from megatron.core import mpu
 from megatron.core.datasets.blended_megatron_dataset_builder import BlendedMegatronDatasetBuilder
 from megatron.core.datasets.blended_megatron_dataset_config import GPTDatasetConfig
 from megatron.core.datasets.gpt_dataset import GPTDataset
-from megatron.model import GPTModel
+from megatron.legacy.model import GPTModel
 from megatron.core.enums import ModelType
 from megatron.training import pretrain
-from megatron.utils import get_ltor_masks_and_position_ids
-from megatron.utils import average_losses_across_data_parallel_group
+from megatron.training.utils import get_ltor_masks_and_position_ids
+from megatron.training.utils import average_losses_across_data_parallel_group
 
 def model_provider(pre_process=True, post_process=True):
  """Build the model."""

diff --git a/examples/detxoify_lm/generate_samples_gpt.py b/examples/detxoify_lm/generate_samples_gpt.py
@@ -9,24 +9,24 @@
 sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__),
  os.path.pardir, os.path.pardir)))
 import torch
-from megatron import get_args
-from megatron import get_tokenizer
-from megatron import print_rank_0
-from megatron.checkpointing import load_checkpoint
+from megatron.training import get_args
+from megatron.training import get_tokenizer
+from megatron.training import print_rank_0
+from megatron.training.checkpointing import load_checkpoint
 from megatron.core import mpu
-from megatron.initialize import initialize_megatron
-from megatron.model import GPTModel
+from megatron.training.initialize import initialize_megatron
+from megatron.legacy.model import GPTModel
 from megatron.training import get_model
-from megatron.text_generation import generate_and_post_process
-from megatron.arguments import core_transformer_config_from_args
+from megatron.inference.text_generation import generate_and_post_process
+from megatron.training.arguments import core_transformer_config_from_args
 from megatron.core.models.gpt import GPTModel
 from typing import Union
-import megatron.model
+import megatron.legacy.model
 from megatron.core.transformer.spec_utils import import_module
-from megatron.arguments import core_transformer_config_from_args
+from megatron.training.arguments import core_transformer_config_from_args
 from megatron.core.models.gpt.gpt_layer_specs import get_gpt_layer_with_transformer_engine_spec, get_gpt_layer_local_spec
 
-def model_provider(pre_process=True, post_process=True) -> Union[GPTModel, megatron.model.GPTModel]:
+def model_provider(pre_process=True, post_process=True) -> Union[GPTModel, megatron.legacy.model.GPTModel]:
  """Builds the model.
 
  If you set the use_mcore_models to True, it will return the mcore GPT model and if not the legacy GPT model.
@@ -37,7 +37,7 @@ def model_provider(pre_process=True, post_process=True) -> Union[GPTModel, megat
 
 
  Returns:
- Union[GPTModel, megatron.model.GPTModel]: The returned model
+ Union[GPTModel, megatron.legacy.model.GPTModel]: The returned model
  """
  args = get_args()
 
@@ -83,7 +83,7 @@ def model_provider(pre_process=True, post_process=True) -> Union[GPTModel, megat
  else:
  assert(args.context_parallel_size == 1), "Context parallelism is only supported with Megatron Core!"
 
- model = megatron.model.GPTModel(
+ model = megatron.legacy.model.GPTModel(
  config,
  num_tokentypes=0,
  parallel_output=True,

diff --git a/examples/deploy/README.md → examples/inference/README.md b/examples/deploy/README.md → examples/inference/README.md
@@ -42,7 +42,7 @@ following checkpoint formats with some remedy:
 
 | GPTModel | sharded | remedy arguments |
 |-----------------------------------|---------|-----------------------------------------|
-| megatron.model  | | `--ammo-load-classic-megatron-to-mcore` |
+| megatron.legacy.model | | `--ammo-load-classic-megatron-to-mcore` |
 | TE-Fused (default mcore gpt spec) | | `--ammo-convert-te-to-local-spec` |
 | TE-Fused (default mcore gpt spec) | x | |
 
@@ -76,7 +76,7 @@ cd ..
 
 Now launch the PTQ + TensorRT-LLM export script,
 ```
-bash examples/deploy/ptq_trtllm_nemotron3_8b ./nemotron-3-8b-base-4k None
+bash examples/inference/ptq_trtllm_nemotron3_8b ./nemotron-3-8b-base-4k None
 ```
 By default, `cnn_dailymail` is used for calibration. The `GPTModel` will have quantizers for simulating the
 quantization effect. The checkpoint will be saved optionally (with quantizers as additional states) and can
@@ -112,7 +112,7 @@ The script expects `${CHECKPOINT_DIR}` (`./nemotron-3-8b-base-4k`) to have the f
 > that we support.
 
 ```sh
-bash examples/deploy/ptq_trtllm_llama_7b.sh ${CHECKPOINT_DIR}
+bash examples/inference/ptq_trtllm_llama_7b.sh ${CHECKPOINT_DIR}
 ```
 
 The script expect `${CHECKPOINT_DIR}` to have the following structure:

diff --git a/examples/deploy/ptq_trtllm_llama_7b.sh → examples/inference/ptq_trtllm_llama_7b.sh b/examples/deploy/ptq_trtllm_llama_7b.sh → examples/inference/ptq_trtllm_llama_7b.sh
@@ -73,7 +73,7 @@ python -c "import ammo.torch.quantization.extensions as ext; print(ext.cuda_ext)
 launch_config="--nproc_per_node=${TP}"
 
 # Launch multi-process with torchrun
-torchrun ${launch_config} examples/deploy/text_generation_ptq.py ${options} ${additional_options} --load ${CHECKPOINT_LOAD_DIR}
+torchrun ${launch_config} examples/inference/text_generation_ptq.py ${options} ${additional_options} --load ${CHECKPOINT_LOAD_DIR}
 
 # This script is using mpi4py which will fork multiple processes.
-python examples/deploy/trtllm_text_generation.py ${trtllm_options}
+python examples/inference/trtllm_text_generation.py ${trtllm_options}
diff --git a/examples/deploy/ptq_trtllm_nemotron3_8b.sh → ...ples/inference/ptq_trtllm_nemotron3_8b.sh b/examples/deploy/ptq_trtllm_nemotron3_8b.sh → ...ples/inference/ptq_trtllm_nemotron3_8b.sh
@@ -68,8 +68,8 @@ python -c "import ammo.torch.quantization.extensions as ext; print(ext.cuda_ext)
 launch_config="--nproc_per_node=${TP}"
 
 # Launch multi-process with torchrun
-torchrun ${launch_config} examples/deploy/text_generation_ptq.py ${options} ${additional_options} --load ${CHECKPOINT_LOAD_DIR}
+torchrun ${launch_config} examples/inference/text_generation_ptq.py ${options} ${additional_options} --load ${CHECKPOINT_LOAD_DIR}
 
 # This script is using mpi4py which will fork multiple processes.
-python examples/deploy/trtllm_text_generation.py ${trtllm_options}
+python examples/inference/trtllm_text_generation.py ${trtllm_options}
 
diff --git a/examples/deploy/text_generation_ptq.py → examples/inference/text_generation_ptq.py b/examples/deploy/text_generation_ptq.py → examples/inference/text_generation_ptq.py
@@ -13,16 +13,16 @@
 from datasets import load_dataset
 
 # [ModelOpt]: changing the default model provider to the AMMO version
-from megatron import get_args, print_rank_0
-from megatron.checkpointing import load_checkpoint, save_checkpoint
+from megatron.training import get_args, print_rank_0
+from megatron.training.checkpointing import load_checkpoint, save_checkpoint
 from megatron.core import mpu
 from megatron.core.dist_checkpointing import load
-from megatron.deploy.arguments import add_ammo_args
-from megatron.deploy.gpt.model_provider import model_provider
-from megatron.initialize import initialize_megatron
-from megatron.text_generation import generate_and_post_process
+from megatron.inference.arguments import add_ammo_args
+from megatron.inference.gpt.model_provider import model_provider
+from megatron.training.initialize import initialize_megatron
+from megatron.inference.text_generation import generate_and_post_process
 from megatron.training import get_model
-from megatron.utils import unwrap_model
+from megatron.training.utils import unwrap_model
 
 QUANT_CFG_CHOICES = {
  "int8": atq.INT8_DEFAULT_CFG,

diff --git a/examples/deploy/trtllm_text_generation.py → examples/inference/trtllm_text_generation.py b/examples/deploy/trtllm_text_generation.py → examples/inference/trtllm_text_generation.py
diff --git a/megatron/core/deploy/__init__.py → megatron/core/inference/__init__.py b/megatron/core/deploy/__init__.py → megatron/core/inference/__init__.py
diff --git a/megatron/core/deploy/gpt/__init__.py → megatron/core/inference/gpt/__init__.py b/megatron/core/deploy/gpt/__init__.py → megatron/core/inference/gpt/__init__.py
diff --git a/megatron/core/deploy/gpt/model_specs.py → megatron/core/inference/gpt/model_specs.py b/megatron/core/deploy/gpt/model_specs.py → megatron/core/inference/gpt/model_specs.py
diff --git a/megatron/core/deploy/gpt/state_dict_hooks.py → ...on/core/inference/gpt/state_dict_hooks.py b/megatron/core/deploy/gpt/state_dict_hooks.py → ...on/core/inference/gpt/state_dict_hooks.py
diff --git a/megatron/deploy/__init__.py → megatron/inference/__init__.py b/megatron/deploy/__init__.py → megatron/inference/__init__.py
diff --git a/megatron/deploy/arguments.py → megatron/inference/arguments.py b/megatron/deploy/arguments.py → megatron/inference/arguments.py
diff --git a/megatron/deploy/gpt/__init__.py → megatron/inference/gpt/__init__.py b/megatron/deploy/gpt/__init__.py → megatron/inference/gpt/__init__.py
diff --git a/megatron/deploy/gpt/model_provider.py → megatron/inference/gpt/model_provider.py b/megatron/deploy/gpt/model_provider.py → megatron/inference/gpt/model_provider.py
@@ -4,10 +4,10 @@
 
 from typing import Union
 
-from megatron import get_args, print_rank_0
-from megatron.arguments import core_transformer_config_from_args
-from megatron.core.deploy.gpt.model_specs import get_gpt_layer_ammo_spec
-from megatron.core.deploy.gpt.state_dict_hooks import (
+from megatron.training import get_args, print_rank_0
+from megatron.training.arguments import core_transformer_config_from_args
+from megatron.core.inference.gpt.model_specs import get_gpt_layer_ammo_spec
+from megatron.core.inference.gpt.state_dict_hooks import (
  mcore_gpt_load_classic_state_dict_pre_hook,
  mcore_gpt_load_te_state_dict_pre_hook,
 )

diff --git a/megatron/static/index.html → megatron/inference/static/index.html b/megatron/static/index.html → megatron/inference/static/index.html
diff --git a/megatron/text_generation/__init__.py → ...ron/inference/text_generation/__init__.py b/megatron/text_generation/__init__.py → ...ron/inference/text_generation/__init__.py
diff --git a/megatron/text_generation/api.py → megatron/inference/text_generation/api.py b/megatron/text_generation/api.py → megatron/inference/text_generation/api.py
diff --git a/megatron/text_generation/beam_utils.py → ...n/inference/text_generation/beam_utils.py b/megatron/text_generation/beam_utils.py → ...n/inference/text_generation/beam_utils.py
diff --git a/megatron/text_generation/communication.py → ...nference/text_generation/communication.py b/megatron/text_generation/communication.py → ...nference/text_generation/communication.py
diff --git a/megatron/text_generation/forward_step.py → ...inference/text_generation/forward_step.py b/megatron/text_generation/forward_step.py → ...inference/text_generation/forward_step.py
@@ -6,7 +6,7 @@
 
 import torch
 
-from megatron import get_args
+from megatron.training import get_args
 from megatron.core import mpu, InferenceParams
 from .communication import (
  send_to_next_pipeline_rank,

diff --git a/megatron/text_generation/generation.py → ...n/inference/text_generation/generation.py b/megatron/text_generation/generation.py → ...n/inference/text_generation/generation.py
@@ -5,9 +5,9 @@
 import torch
 import torch.nn.functional as F
 
-from megatron import get_args, get_tokenizer
+from megatron.training import get_args, get_tokenizer
 from megatron.core import mpu
-from megatron.utils import get_ltor_masks_and_position_ids
+from megatron.training.utils import get_ltor_masks_and_position_ids
 from .communication import (
  copy_from_last_to_first_pipeline_stage,
  broadcast_from_last_pipeline_stage,

diff --git a/megatron/text_generation/sampling.py → ...ron/inference/text_generation/sampling.py b/megatron/text_generation/sampling.py → ...ron/inference/text_generation/sampling.py
diff --git a/megatron/text_generation/tokenization.py → ...inference/text_generation/tokenization.py b/megatron/text_generation/tokenization.py → ...inference/text_generation/tokenization.py
@@ -6,7 +6,7 @@
 import torch
 
 
-from megatron import get_tokenizer, get_args
+from megatron.training import get_tokenizer, get_args
 from .communication import broadcast_int_list, broadcast_tensor
 
 

diff --git a/megatron/text_generation_server.py → megatron/inference/text_generation_server.py b/megatron/text_generation_server.py → megatron/inference/text_generation_server.py
@@ -5,9 +5,9 @@
 import threading
 from flask import Flask, request, jsonify, current_app
 from flask_restful import Resource, Api
-from megatron import get_args
-from megatron.text_generation import generate_and_post_process
-from megatron.text_generation import beam_search_and_post_process
+from megatron.training import get_args
+from megatron.inference.text_generation import generate_and_post_process
+from megatron.inference.text_generation import beam_search_and_post_process
 
 
 GENERATE_NUM = 0

diff --git a/megatron/data/__init__.py → megatron/legacy/data/__init__.py b/megatron/data/__init__.py → megatron/legacy/data/__init__.py
diff --git a/megatron/data/autoaugment.py → megatron/legacy/data/autoaugment.py b/megatron/data/autoaugment.py → megatron/legacy/data/autoaugment.py
diff --git a/megatron/data/biencoder_dataset_utils.py → ...on/legacy/data/biencoder_dataset_utils.py b/megatron/data/biencoder_dataset_utils.py → ...on/legacy/data/biencoder_dataset_utils.py
@@ -4,11 +4,11 @@
 import numpy as np
 import torch
 
-from megatron import get_args, get_tokenizer, print_rank_0
+from megatron.training import get_args, get_tokenizer, print_rank_0
 from megatron.core import mpu, tensor_parallel
-from megatron.data.dataset_utils import create_masked_lm_predictions, \
+from megatron.legacy.data.dataset_utils import create_masked_lm_predictions, \
  pad_and_convert_to_numpy
-from megatron.data.data_samplers import MegatronPretrainingSampler
+from megatron.legacy.data.data_samplers import MegatronPretrainingSampler
 
 def make_attention_mask(source_block, target_block):
  """

diff --git a/megatron/data/data_samplers.py → megatron/legacy/data/data_samplers.py b/megatron/data/data_samplers.py → megatron/legacy/data/data_samplers.py
@@ -7,7 +7,7 @@
 import torch
 import numpy as np
 from torch.utils.data import Dataset
-from megatron import get_args
+from megatron.training import get_args
 from megatron.core import mpu
 
 

diff --git a/megatron/data/dataset_utils.py → megatron/legacy/data/dataset_utils.py b/megatron/data/dataset_utils.py → megatron/legacy/data/dataset_utils.py
@@ -26,7 +26,7 @@
 import numpy as np
 import torch
 
-from megatron import (
+from megatron.training import (
  get_args,
  print_rank_0
 )
@@ -535,8 +535,8 @@ def build_dataset(name, data_prefix, max_num_samples,
  max_seq_length_dec, dataset_type='standard_bert',
  indexed_dataset=None):
 
- from megatron.data.ict_dataset import ICTDataset
- from megatron.data.multimodal_dataset import MultiModalDataset
+ from megatron.legacy.data.ict_dataset import ICTDataset
+ from megatron.legacy.data.multimodal_dataset import MultiModalDataset
 
  if dataset_type == DSET_TYPE_BERT or dataset_type == DSET_TYPE_T5:
  raise ValueError("The Megatron-LM BERT and T5 datasets are deprecated.")

diff --git a/megatron/data/ict_dataset.py → megatron/legacy/data/ict_dataset.py b/megatron/data/ict_dataset.py → megatron/legacy/data/ict_dataset.py
@@ -4,10 +4,10 @@
 import numpy as np
 from torch.utils.data import Dataset
 
-from megatron import get_tokenizer
-from megatron import get_args
-from megatron.data.dataset_utils import get_indexed_dataset_
-from megatron.data.realm_dataset_utils import get_block_samples_mapping
+from megatron.training import get_tokenizer
+from megatron.training import get_args
+from megatron.legacy.data.dataset_utils import get_indexed_dataset_
+from megatron.legacy.data.realm_dataset_utils import get_block_samples_mapping
 
 def make_attention_mask(source_block, target_block):
  """

diff --git a/megatron/data/image_folder.py → megatron/legacy/data/image_folder.py b/megatron/data/image_folder.py → megatron/legacy/data/image_folder.py
diff --git a/megatron/data/multimodal_dataset.py → megatron/legacy/data/multimodal_dataset.py b/megatron/data/multimodal_dataset.py → megatron/legacy/data/multimodal_dataset.py
diff --git a/megatron/data/orqa_wiki_dataset.py → megatron/legacy/data/orqa_wiki_dataset.py b/megatron/data/orqa_wiki_dataset.py → megatron/legacy/data/orqa_wiki_dataset.py
@@ -9,9 +9,9 @@
 import torch
 from torch.utils.data import Dataset
 
-from megatron import print_rank_0, get_args, get_tokenizer
+from megatron.training import print_rank_0, get_args, get_tokenizer
 from megatron.core import tensor_parallel
-from megatron.data.biencoder_dataset_utils import make_attention_mask
+from megatron.legacy.data.biencoder_dataset_utils import make_attention_mask
 
 def get_open_retrieval_wiki_dataset():
  args = get_args()

diff --git a/megatron/data/realm_dataset_utils.py → megatron/legacy/data/realm_dataset_utils.py b/megatron/data/realm_dataset_utils.py → megatron/legacy/data/realm_dataset_utils.py
@@ -4,10 +4,10 @@
 import numpy as np
 import torch
 
-from megatron import print_rank_0
+from megatron.training import print_rank_0
 from megatron.core import mpu, tensor_parallel
-from megatron.data.dataset_utils import create_masked_lm_predictions, pad_and_convert_to_numpy
-from megatron import get_args, get_tokenizer, print_rank_0
+from megatron.legacy.data.dataset_utils import create_masked_lm_predictions, pad_and_convert_to_numpy
+from megatron.training import get_args, get_tokenizer, print_rank_0
 
 
 def get_one_epoch_dataloader(dataset, micro_batch_size=None):
@@ -24,7 +24,7 @@ def get_one_epoch_dataloader(dataset, micro_batch_size=None):
  sampler = torch.utils.data.SequentialSampler(dataset)
  # importantly, drop_last must be False to get all the data.
  assert False, 'DistributedBatchSampler deprecated, change the implementation'
- from megatron.data.samplers import DistributedBatchSampler
+ from megatron.legacy.data.samplers import DistributedBatchSampler
  batch_sampler = DistributedBatchSampler(sampler,
  batch_size=global_batch_size,
  drop_last=False,

diff --git a/megatron/data/realm_index.py → megatron/legacy/data/realm_index.py b/megatron/data/realm_index.py → megatron/legacy/data/realm_index.py
@@ -6,7 +6,7 @@
 import numpy as np
 import torch
 
-from megatron import get_args
+from megatron.training import get_args
 from megatron.core import mpu
 
 

diff --git a/megatron/data/vit_dataset.py → megatron/legacy/data/vit_dataset.py b/megatron/data/vit_dataset.py → megatron/legacy/data/vit_dataset.py
@@ -5,10 +5,10 @@
 import torch
 import torchvision.transforms as T
 from torchvision import datasets
-from megatron import get_args
-from megatron.data.image_folder import ImageFolder
-from megatron.data.autoaugment import ImageNetPolicy
-from megatron.data.data_samplers import RandomSeedDataset
+from megatron.training import get_args
+from megatron.legacy.data.image_folder import ImageFolder
+from megatron.legacy.data.autoaugment import ImageNetPolicy
+from megatron.legacy.data.data_samplers import RandomSeedDataset
 from PIL import Image, ImageFilter, ImageOps
 
 

diff --git a/megatron/fp16_deprecated/loss_scaler.py → ...ron/legacy/fp16_deprecated/loss_scaler.py b/megatron/fp16_deprecated/loss_scaler.py → ...ron/legacy/fp16_deprecated/loss_scaler.py
diff --git a/megatron/fused_kernels/__init__.py → megatron/legacy/fused_kernels/__init__.py b/megatron/fused_kernels/__init__.py → megatron/legacy/fused_kernels/__init__.py
diff --git a/megatron/fused_kernels/compat.h → megatron/legacy/fused_kernels/compat.h b/megatron/fused_kernels/compat.h → megatron/legacy/fused_kernels/compat.h
diff --git a/megatron/fused_kernels/tests/__init__.py → ...on/legacy/fused_kernels/tests/__init__.py b/megatron/fused_kernels/tests/__init__.py → ...on/legacy/fused_kernels/tests/__init__.py
diff --git a/...fused_kernels/tests/test_fused_kernels.py → ...fused_kernels/tests/test_fused_kernels.py b/...fused_kernels/tests/test_fused_kernels.py → ...fused_kernels/tests/test_fused_kernels.py
@@ -3,11 +3,11 @@
 import torch
 from torch.nn import LayerNorm
 
-from megatron.model.enums import AttnMaskType
-from megatron.model.fused_layer_norm import MixedFusedLayerNorm
-from megatron.model.fused_softmax import FusedScaleMaskSoftmax
-from megatron.model.utils import attention_mask_func
-from megatron.fused_kernels import load
+from megatron.legacy.model.enums import AttnMaskType
+from megatron.legacy.model.fused_layer_norm import MixedFusedLayerNorm
+from megatron.legacy.model.fused_softmax import FusedScaleMaskSoftmax
+from megatron.legacy.model.utils import attention_mask_func
+from megatron.legacy.fused_kernels import load
 
 def test_load_fused_kernels():
  try:

diff --git a/megatron/fused_kernels/type_shim.h → megatron/legacy/fused_kernels/type_shim.h b/megatron/fused_kernels/type_shim.h → megatron/legacy/fused_kernels/type_shim.h