GitHub - ytwang13/masked-analysis

Introduction

in this repo, we provide code for project " Investigating masked model in representation learning and anti-forgetting". Here we will provide guidance for installation, peforming experiments and some results.

Install

Please refer to MMpretrain for more details, below are the copy of theirs:

Below are quick steps for installation:

conda create -n open-mmlab python=3.8 pytorch==1.10.1 torchvision==0.11.2 cudatoolkit=11.3 -c pytorch -y
conda activate open-mmlab
pip install openmim
git clone https://github.com/open-mmlab/mmpretrain.git
cd mmpretrain
mim install -e .

Please refer to installation documentation for more detailed installation and dataset preparation.

For multi-modality models support, please install the extra dependencies by:

mim install -e ".[multimodal]"

For wandb log, please install wandb using the following

pip install wandb
wandb login

and uncomment the following line from the corresponding config file in cifar-img/dl_res18_exp

# vis_backends = dict(type='WandbVisBackend', xxxxxxxxx

Experiments

All experiments can be seen in tools/res_trails directory
All related configurations can be seen in cifar-img/dl_res18_exp

Baseline - naive training /KD training

The experiments of naive CE training can be found in /tools/cl_trails/baseline_CE
Corresponding configs are in cifar-img/dl_res18_exp/baseline

bash  /---your---own---working--dir--/tools/cl_trails/baseline_CE/run_baselinece.sh

The experiments of KD training can be found in tools/cl_trails/baseline_KD
Corresponding configs are in cifar-img/dl_res18_exp/baseline

#reference model basic swap and freeze
bash  /---your---own---working--dir--/tools/cl_trails/baseline_KD/run_kdbaseline.sh

#change the reference model using EMA
bash /---your---own---working--dir--/tools/cl_trails/baseline_KD/run_kdema_var.sh 0.1 #emaratio
#${ema_ratios[$SLURM_ARRAY_TASK_ID]} #for slurm array job

#for further classifier only(clso: we only use the classifer head as reference model) experiments, just change the config: model.train_cfg.cls_only=True

Masked - Loss_msk type, Mode, Ratio

The masked model in this work can be illustrated in the following figure:

\

where the backbone and classifier (task) head are denoted by black trapezoids, and a black rectangle represents the hidden/output feature. Here, the gray arrow denotes the baseline CE pipeline, and the light purple arrow shows the KD/Masked reference pipeline.

The light purple rectangle denotes the Masked Model method, where we apply a mask in the hidden feature. While the classifier head requires the input channel unchanged, we apply mask tokens to append the masked part, as shown in the light purple dashed rectangle on the left.

Inside this rectangle, the gray part represents the masked appended token, and the white part is the unmasked part. On the right part, we have two inverse masked representations. Getting inspired by the diffusion model where the model predicts the noise, we provide the inverse masked feature to the model, hoping the model recovers better from the masked representation (the inverse forward has detached gradient).

In addition to the base-KD/Mask model, we further introduce the cls_only line of KD/Mask model, which is illustrated with the same masked strategy and a light blue arrow.

Black trapezoid: Backbone and classifier (task) head
Black rectangle: Hidden/output feature
Gray arrow: Baseline CE pipeline
Light purple arrow: KD/Masked reference pipeline
Light purple rectangle: Masked Model method
Gray part inside the light purple rectangle: Masked appended token
White part inside the light purple rectangle: Unmasked part
Light blue arrow: cls_only line of KD/Mask model

Masked_classifier_only - Loss_msk type, Mode, Ratio

The experiments of Masked model training can be found in tools/cl_trails/mask
Corresponding configs are in cifar-img/dl_res18_exp/mask

bash  /---your---own---working--dir--/tools/cl_trails/mask/run_mskall_var.sh 0.9 'v1' 'sgf' #mask_ratio #inv_mode #mask_mode
# for inv_mode
    # this is just a selection between whether 
     ## v1: we add the inv_mask output logits on masked output
     ## v2: we substract the inv_mask output logits from the target outputs
     ###  p.s. target outputs[the under rectangle logits in the figure]
            #  masked output[the upper rectangle logits]
# for mask_mode
    # this is just difference of different selection of masked tokens to append the in_channel 
     ## s: we use single 1*1 torch tensor 
     ## a: we use all 1*(channels- channels*mask_ratio) torch tensor
     ## sgd: single tensor with gradient=false
     ## agd: all tensor with gradient=false
     ###  p.s.: [all will be expanded as same batch-size of hidden feature.]

bash  /---your---own---working--dir--/tools/cl_trails/mask/run_mskclso_var.sh 0.9 'v1' 'sgf' #mask_ratio #inv_mode #mask_mode


bash  /---your---own---working--dir--/tools/cl_trails/mask/run_mskclso_var.sh 0.9 'v1' 'sgf' 0.1 'ema' #mask_ratio #inv_mode #mask_mode #ema_ratio #ema_mode
# for ema_mode
    # this is just a seletion of different ema_mode
     ## 'ema' ExponentialMovingAverage
     ## 'anne' MomentumAnnealingEMA
     ## 'sto' StochasticWeightAverage

Results

KD, Mask

All baseline and Self-KD hyperparameter results table can be found in tools/cl_trails/test_cl.sh, tools/cl_trails/baseline_KD/run_SD.md

All method comparision using wandb

Rank(me) details see: https://arxiv.org/abs/2210.02885, can be explained as a kind of feature learning quality metric, similar to the rank of representation. Here we can clearly see in the plot

So far the KD ema results have slightly better accuracy/top1 results and larger rankme value (not fully shown explained in note) [purple VS blue]
The naive training [blue] have the tendency of constant droping Rank, which may due to the influence of classifer head(which has a large-to-small compression, but do not show on SSL method, which have a small-to-large[or kind of equivilant] reconstruction head)
KD/Mask have the power of considerably increase the rank of our feature learning process, but for mask, the accuracy part do not surpass the baseline.(maybe can try a linear probing accuracy?)

Note: Since HPC require storage limitations, so I'm annoyed by wandb's artifacts, I accidentally rm -rf * of some of the logs of wandb results while deleting the wandb cache.

Comparision on Cifar10.1 evaluation

kd/msk method do have a smaller performance gap, is this a trade off or can be alleviated by LP(good representation learnt) or some other method need to be proposed

Comparision on Linear Probing results

is good representation kind of reflected by RankMe?
is LP properly implemented?
what experiements to follow?

Rank Log:

KD-EMA LP and CIFAR10.1 Results Step25 Overview [Most EMA ratio and loss weight]:
- LP:
  1. The LP results and original classifier predictions are totally the same for the CIFAR10.1 dataset.
  2. However, the LP results on the CIFAR10 test set are sometimes slightly better than the original dataset.
- Rank: 3. Based on rank, as stated before, the loss weight on the auxiliary loss plays the main part. (Distillation loss helps improve rank) 4. Higher rank (>=59) tends to exhibit higher accuracy/top1 but not quite as high as ensemble distill results.

KD-EMA LP and CIFAR10.1 Results Step1 Overview
- With all the above LP results, what is different is that LP results of this step 1 started version downgrade slightly less on CIFAR10.1 vs CIFAR10. (Step1: 90->82.55 to Step25: 91->81)
- Rank also exhibits a similar result. This time the bottom line is higher and the rise of rank starts almost in line with the start step of the method.
KD-Base LP and CIFAR10.1 Results Overview [Most EMA ratio and loss weight]:
- LP:
  1. The KD base part of the LP results vs original classifier head predictions, they are all the same.
  2. The margin between CIFAR original test dataset vs CIFAR10.1 is higher compared to that of the KD-EMA mode.
  3. The highest rank value is almost the lowest rank value of KD-EMA mode.
  4. Test higher LP learning rate multiplier still pending.

TODO

Test masked results performance on CIFAR-10.1 dataset. ✅ ✅
Finish EMA masked model results. ✅ ❓
- ema on backbone
- ema based on lp mult best
- mskall baseline based on kdema best configure
- try difussion thing
Test longer epochs of key hyperparameter selections. ❓ ❓
Finish masked experiments on continual learning setting. ❓ ❓

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
cifar-img		cifar-img
configs		configs
demo		demo
docker		docker
docs		docs
mmpretrain		mmpretrain
projects		projects
requirements		requirements
resources		resources
tests		tests
tools		tools
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
README_zh-CN.md		README_zh-CN.md
dataset-index.yml		dataset-index.yml
inference_csv.ipynb		inference_csv.ipynb
model-index.yml		model-index.yml
out		out
readme.md		readme.md
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Install

Experiments

Baseline - naive training /KD training

Masked - Loss_msk type, Mode, Ratio

Masked_classifier_only - Loss_msk type, Mode, Ratio

Results

KD, Mask

All method comparision using wandb

Comparision on Cifar10.1 evaluation

Comparision on Linear Probing results

TODO

About

Releases

Packages

Languages

License

ytwang13/masked-analysis

Folders and files

Latest commit

History

Repository files navigation

Introduction

Install

Experiments

Baseline - naive training /KD training

Masked - Loss_msk type, Mode, Ratio

Masked_classifier_only - Loss_msk type, Mode, Ratio

Results

KD, Mask

All method comparision using wandb

Comparision on Cifar10.1 evaluation

Comparision on Linear Probing results

TODO

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages