Mismatch about the evaluation results #118

yuzc19 · 2023-09-21T18:33:25Z

Hi,

Thanks for your great work!

I am trying to reproduce the results in pythia-160m-zero-shot using lit-gpt repo. However, while the results of step13000 checkpoint match well with your repo. The results of step143000 checkpoint fall behind the numbers in your repo by large margins. For example, it can only achieve a 0.121 accuracy on Lambada (OpenAI) from my test, while the number is 0.328 in your repo. Why is that?

From the paper, I think 143000 steps equal one epoch (also the same as the main branch in the huggingface). Am I correct?

Thank you!

uSaiPrashanth · 2023-09-21T18:57:54Z

Could you try evaluating the model with lm-eval-harness and share the resultant json here?

These are the resultant jsons directly uploaded from this script And I am confident with the results updated.

yuzc19 · 2023-09-21T22:14:08Z

Yeah, I am evaluating model with lm-eval-harness and this is my output:

"results": {
"piqa": {
"acc": 0.5903155603917302,
"acc_stderr": 0.0,
"acc_norm": 0.5903155603917302,
"acc_norm_stderr": 0.0
},
"winogrande": {
"acc": 0.4956590370955012,
"acc_stderr": 0.0
},
"arc_easy": {
"acc": 0.37457912457912457,
"acc_stderr": 0.0,
"acc_norm": 0.3686868686868687,
"acc_norm_stderr": 0.0
},
"arc_challenge": {
"acc": 0.19965870307167236,
"acc_stderr": 0.0,
"acc_norm": 0.23890784982935154,
"acc_norm_stderr": 0.0
},
"logiqa": {
"acc": 0.19201228878648233,
"acc_stderr": 0.0,
"acc_norm": 0.22119815668202766,
"acc_norm_stderr": 0.0
},
"lambada_openai": {
"ppl": 106.66464362385898,
"ppl_stderr": 0.0,
"acc": 0.12128856976518533,
"acc_stderr": 0.0
},
"wsc": {
"acc": 0.5865384615384616,
"acc_stderr": 0.0
},
"sciq": {
"acc": 0.555,
"acc_stderr": 0.0,
"acc_norm": 0.625,
"acc_norm_stderr": 0.0
}
},

uSaiPrashanth · 2023-09-22T03:20:51Z

Could you share the full json file?

yuzc19 · 2023-09-22T03:51:35Z

Below the results are some configurations:

"versions": {
"arc_challenge": 0,
"logiqa": 0,
"wsc": 0,
"sciq": 0,
"piqa": 0,
"lambada_openai": 0,
"arc_easy": 0,
"winogrande": 0
},
"config": {
"model": "neox",
"model_args": {
"distributed_backend": "nccl",
"local_rank": 0,
"rank": 0,
"lazy_mpu_init": false,
"short_seq_prob": 0.1,
"eod_mask_loss": false,
"adlr_autoresume": false,
"adlr_autoresume_interval": 1000,
"seed": 1234,
"onnx_safe": false,
"deepscale": false,
"deepscale_config": null,
"deepspeed_mpi": false,
"deepspeed_slurm": false,
"user_script": "evaluate.py",
"iteration": 0,
"do_train": null,
"do_valid": null,
"do_test": null,
"save_iters": [
0,
1,
2,
4,
8,
16,
32,
64,
128,
256,
512,
1000,
2000,
3000,
4000,
5000,
6000,
7000,
8000,
9000,
10000,
11000,
12000,
13000,
14000,
15000,
16000,
17000,
18000,
19000,
20000,
21000,
22000,
23000,
24000,
25000,
26000,
27000,
28000,
29000,
30000,
31000,
32000,
33000,
34000,
35000,
36000,
37000,
38000,
39000,
40000,
41000,
42000,
43000,
44000,
45000,
46000,
47000,
48000,
49000,
50000,
51000,
52000,
53000,
54000,
55000,
56000,
57000,
58000,
59000,
60000,
61000,
62000,
63000,
64000,
65000,
66000,
67000,
68000,
69000,
70000,
71000,
72000,
73000,
74000,
75000,
76000,
77000,
78000,
79000,
80000,
81000,
82000,
83000,
84000,
85000,
86000,
87000,
88000,
89000,
90000,
91000,
92000,
93000,
94000,
95000,
96000,
97000,
98000,
99000,
100000,
101000,
102000,
103000,
104000,
105000,
106000,
107000,
108000,
109000,
110000,
111000,
112000,
113000,
114000,
115000,
116000,
117000,
118000,
119000,
120000,
121000,
122000,
123000,
124000,
125000,
126000,
127000,
128000,
129000,
130000,
131000,
132000,
133000,
134000,
135000,
136000,
137000,
138000,
139000,
140000,
141000,
142000
],
"global_num_gpus": 8,
"text_gen_type": "unconditional",
"temperature": 0.0,
"top_p": 0.0,
"top_k": 0,
"return_logits": false,
"maximum_tokens": 64,
"prompt_end": "\n",
"sample_input_file": null,
"sample_output_file": "samples.txt",
"num_samples": 1,
"recompute": false,
"eval_results_prefix": "",
"eval_tasks": [
"lambada_openai",
"piqa",
"winogrande",
"wsc",
"arc_easy",
"arc_challenge",
"sciq",
"logiqa"
],
"use_wandb": false,
"wandb_group": null,
"wandb_team": null,
"wandb_project": "neox",
"wandb_host": "https://api.wandb.ai",
"wandb_init_all_ranks": false,
"git_hash": "444c0ef",
"log_dir": "logs",
"tensorboard_dir": "tensorboard",
"log_interval": 10,
"log_grad_pct_zeros": false,
"log_param_norm": false,
"log_grad_norm": false,
"log_optimizer_states": false,
"log_gradient_noise_scale": false,
"gradient_noise_scale_n_batches": 5,
"gradient_noise_scale_cpu_offload": false,
"pipe_parallel_size": 1,
"model_parallel_size": 1,
"pipe_partition_method": "type:transformer|mlp",
"world_size": 8,
"is_pipe_parallel": true,
"data_path": "data/enwik8/enwik8_text_document",
"use_shared_fs": true,
"train_data_paths": null,
"label_data_paths": null,
"test_data_paths": null,
"valid_data_paths": null,
"train_data_weights": null,
"valid_data_weights": null,
"test_data_weights": null,
"weight_by_num_documents": false,
"weighted_sampler_alpha": 0.3,
"data_impl": "mmap",
"mmap_warmup": false,
"save": "checkpoints",
"config_files": {
"slurm_local.yml": "{\n "data_path": "data/enwik8/enwik8_text_document",\n "vocab_file": "../Lightning-Pretrain/checkpoints/pythia/tokenizer.json",\n # "merge_file": "data/gpt2-merges.txt",\n "load": "checkpoints/neox_converted/pythia/160m",\n "save": "checkpoints",\n "checkpoint_validation_with_forward_pass": false,\n "tensorboard_dir": "tensorboard",\n "log_dir": "logs",\n "use_wandb": false,\n "wandb_host": "https://api.wandb.ai\",\n "wandb_project": "neox"\n}\n",
"160M.yml": "{\n "pipe_parallel_size": 1,\n "model_parallel_size": 1,\n\n "num_layers": 12,\n "hidden_size": 768,\n "num_attention_heads": 12,\n "seq_length": 2048,\n "max_position_embeddings": 2048,\n "pos_emb": "rotary",\n "rotary_pct": 0.25,\n "no_weight_tying": true,\n "gpt_j_residual": true,\n "output_layer_parallelism": "column",\n\n "attention_config": [[["flash"], 12]],\n\n "scaled_upper_triang_masked_softmax_fusion": true,\n "bias_gelu_fusion": true,\n\n "init_method": "small_init",\n "output_layer_init_method": "wang_init",\n\n "optimizer": {\n "type": "Adam",\n "params": {\n "lr": 0.0006,\n "betas": [0.9, 0.95],\n "eps": 1.0e-8\n }\n },\n "min_lr": 0.00006,\n\n # "zero_optimization": {\n # "stage": 1,\n # "allgather_partitions": true,\n # "allgather_bucket_size": 500000000,\n # "overlap_comm": true,\n # "reduce_scatter": true,\n # "reduce_bucket_size": 500000000,\n # "contiguous_gradients": true,\n # "cpu_offload": false\n # },\n\n "train_micro_batch_size_per_gpu": 32,\n "gas": 1,\n "data_impl": "mmap",\n "num_workers": 1,\n\n "checkpoint_activations": true,\n "checkpoint_num_layers": 1,\n "partition_activations": true,\n "synchronize_each_layer": true,\n\n "gradient_clipping": 1.0,\n "weight_decay": 0.1,\n "hidden_dropout": 0,\n "attention_dropout": 0,\n\n "fp16": {\n "fp16": true,\n "enabled": true,\n "loss_scale": 0,\n "loss_scale_window": 1000,\n "initial_scale_power": 12,\n "hysteresis": 2,\n "min_loss_scale": 1\n },\n\n "train_iters": 143000,\n "lr_decay_iters": 143000,\n "distributed_backend": "nccl",\n "lr_decay_style": "cosine",\n "warmup": 0.01,\n "checkpoint_factor": 1000,\n "extra_save_iters": [0,1,2,4,8,16,32,64,128,256,512],\n "eval_interval": 143000,\n "eval_iters": 10,\n\n "log_interval": 10,\n "steps_per_print": 10,\n "wall_clock_breakdown": true,\n\n "tokenizer_type": "HFTokenizer"\n}\n"
},
"load": "checkpoints/neox_converted/pythia/160m",
"checkpoint_validation_with_forward_pass": false,
"checkpoint_scale": "linear",
"checkpoint_factor": 1000,
"extra_save_iters": [
0,
1,
2,
4,
8,
16,
32,
64,
128,
256,
512
],
"no_save_optim": false,
"no_save_rng": false,
"no_load_optim": true,
"no_load_rng": false,
"finetune": false,
"batch_size": 32,
"train_iters": 143000,
"eval_iters": 10,
"keep_last_n_checkpoints": null,
"eval_interval": 143000,
"split": "969, 30, 1",
"vocab_file": "../Lightning-Pretrain/checkpoints/pythia/tokenizer.json",
"merge_file": null,
"num_workers": 1,
"exit_interval": null,
"attention_dropout": 0,
"hidden_dropout": 0,
"weight_decay": 0.1,
"checkpoint_activations": false,
"checkpoint_num_layers": 1,
"deepspeed_activation_checkpointing": true,
"contiguous_checkpointing": false,
"checkpoint_in_cpu": false,
"synchronize_each_layer": true,
"profile_backward": false,
"partition_activations": false,
"gas": 1,
"clip_grad": 1.0,
"hysteresis": 2,
"dynamic_loss_scale": true,
"loss_scale": null,
"loss_scale_window": 1000.0,
"min_scale": 1.0,
"char_level_ppl": false,
"use_mup": false,
"coord_check": false,
"save_base_shapes": false,
"base_shapes_file": null,
"mup_init_scale": 1.0,
"mup_attn_temp": 1.0,
"mup_output_temp": 1.0,
"mup_embedding_mult": 1.0,
"mup_rp_embedding_mult": 1.0,
"mup_width_scale": 2,
"tokenizer_type": "HFTokenizer",
"padded_vocab_size": 50304,
"optimizer_type": "Adam",
"use_bnb_optimizer": false,
"zero_stage": 0,
"zero_reduce_scatter": true,
"zero_contiguous_gradients": false,
"zero_reduce_bucket_size": 500000000,
"zero_allgather_bucket_size": 500000000,
"lr": 0.001,
"lr_decay_style": "cosine",
"lr_decay_iters": 143000,
"min_lr": 6e-05,
"warmup": 0.01,
"override_lr_scheduler": false,
"use_checkpoint_lr_scheduler": false,
"precision": "fp16",
"num_layers": 12,
"hidden_size": 768,
"num_attention_heads": 12,
"seq_length": 2048,
"max_position_embeddings": 2048,
"norm": "layernorm",
"layernorm_epsilon": 1e-05,
"rms_norm_epsilon": 1e-08,
"scalenorm_epsilon": 1e-08,
"pos_emb": "rotary",
"rpe_num_buckets": 32,
"rpe_max_distance": 128,
"opt_pos_emb_offset": 0,
"no_weight_tying": true,
"attention_config": [
"flash",
"flash",
"flash",
"flash",
"flash",
"flash",
"flash",
"flash",
"flash",
"flash",
"flash",
"flash"
],
"sparsity_config": {},
"num_unique_layers": null,
"param_sharing_style": "grouped",
"make_vocab_size_divisible_by": 128,
"activation": "gelu",
"scaled_upper_triang_masked_softmax_fusion": true,
"scaled_masked_softmax_fusion": false,
"bias_gelu_fusion": true,
"bias_dropout_fusion": false,
"fp16_lm_cross_entropy": false,
"init_method_std": 0.02,
"apply_query_key_layer_scaling": false,
"use_cpu_initialization": false,
"attention_softmax_in_fp32": false,
"rotary_pct": 0.25,
"rotary_emb_base": 10000,
"init_method": "small_init",
"output_layer_init_method": "wang_init",
"gmlp_attn_dim": 64,
"gpt_j_residual": true,
"gpt_j_tied": false,
"use_bias_in_norms": true,
"use_bias_in_attn_linear": true,
"mlp_type": "regular",
"soft_prompt_tuning": null,
"output_layer_parallelism": "column",
"deepspeed": true,
"train_batch_size": 256,
"train_micro_batch_size_per_gpu": 32,
"gradient_accumulation_steps": 1,
"optimizer": null,
"scheduler": null,
"fp32_allreduce": false,
"prescale_gradients": false,
"gradient_predivide_factor": 1.0,
"sparse_gradients": false,
"fp16": {
"fp16": true,
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 12,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": null,
"amp": null,
"gradient_clipping": 1.0,
"zero_optimization": {
"stage": 0,
"allgather_partitions": true,
"reduce_scatter": true,
"allgather_bucket_size": 500000000,
"overlap_comm": false,
"reduce_bucket_size": 500000000,
"contiguous_gradients": false
},
"curriculum_learning": null,
"curriculum_seqlen": 0,
"steps_per_print": 10,
"wall_clock_breakdown": true,
"dump_state": false,
"flops_profiler": null,
"communication_data_type": null,
"autotuning": null,
"activation_checkpointing": null,
"sparse_attention": null,
"data_efficiency": null,
"tensorboard": null,
"wandb": null,
"csv_monitor": null,
"elasticity": null,
"comms_logger": null,
"compression_training": null,
"checkpoint": null,
"data_types": null,
"deepspeed_extra_args": null,
"hostfile": null,
"include": null,
"exclude": null,
"num_nodes": -1,
"num_gpus": null,
"master_port": 29500,
"master_addr": null,
"launcher": "pdsh",
"force_multi": false,
"detect_nvlink_pairs": false,
"autotuning_run": null,
"no_ssh_check": false,
"comment": null
},
"num_fewshot": 0,
"batch_size": 256,
"device": "cuda:0",
"no_cache": true,
"limit": null,
"bootstrap_iters": 10000,
"description_dict": null
}

yuzc19 · 2023-09-22T03:53:55Z

I have to say I test checkpoint-100000 (works very very well, with lambada accuracy of 0.308) and checkpoint-133000 (works somehow bad, with lambada accuracy of 0.112). I have no idea why there would be such a performance drop within 30000 steps.

yuzc19 · 2023-09-22T21:30:02Z

I just wonder if you have uploaded the wrong checkpoint for Pythia-160M 143000 checkpoints (also the main branch).

uSaiPrashanth · 2023-09-24T12:05:25Z

I just evaluated step143000 again and here are the results:

{
  "results": {
    "arc_challenge": {
      "acc": 0.1825938566552901,
      "acc_stderr": 0.011289730684565,
      "acc_norm": 0.2354948805460751,
      "acc_norm_stderr": 0.012399451855004746
    },
    "arc_easy": {
      "acc": 0.43602693602693604,
      "acc_stderr": 0.010175459582759727,
      "acc_norm": 0.3977272727272727,
      "acc_norm_stderr": 0.010042861602178066
    },
    "piqa": {
      "acc": 0.6251360174102285,
      "acc_stderr": 0.011294565805619017,
      "acc_norm": 0.6175190424374319,
      "acc_norm_stderr": 0.011339019654272349
    }
  },
  "versions": {
    "arc_challenge": 0,
    "arc_easy": 0,
    "piqa": 0
  },
  "config": {
    "model": "hf-causal",
    "model_args": "pretrained=EleutherAI/pythia-160m,revision=step143000",
    "num_fewshot": 0,
    "batch_size": null,
    "batch_sizes": [],
    "device": null,
    "no_cache": false,
    "limit": null,
    "bootstrap_iters": 100000,
    "description_dict": {}
  }
}

hf-causal (pretrained=EleutherAI/pythia-160m,revision=step143000), limit: None, provide_description: False, num_fewshot: 0, batch_size: None

Task	Version	Metric	Value		Stderr
arc_challenge	0	acc	0.1826	±	0.0113
		acc_norm	0.2355	±	0.0124
arc_easy	0	acc	0.4360	±	0.0102
		acc_norm	0.3977	±	0.0100
piqa	0	acc	0.6251	±	0.0113
		acc_norm	0.6175	±	0.0113

I would suggest you to

Try evaluating the uploaded huggingface model, to make sure your results match with mine
Try reconverting the checkpoint back to neox again.

yuzc19 · 2023-09-25T14:35:08Z

Yeah, the results of the HF model match well, but Neox falls behind, could it be possible?

BTW, what are Lambada (OpenAI) results here?

uSaiPrashanth · 2023-10-30T17:39:38Z

As mentioned and verified, there are no issues with evaluation results on this repo. So, Lambada (OpenAI) evals will remain the same as the ones mentioned in appropriate files

inf3rnus · 2024-01-19T00:34:46Z

fwiw I'm finding that I'm having a similar issue as @yuzc19, lambada (lambada_openai) is showing me similar results for EleutherAI/pythia-160m.

Note, I'm just trying to replicate the results in the paper because the default model for HF does not line up at all.

e.g.

result = simple_evaluate(
    model="hf-auto",
    model_args=",".join(
        [
            "pretrained=EleutherAI/pythia-160m",
            "revision=step100000",
            # "batch_size=16",
            "parallelize=True",
        ]
    ),
    tasks=["lambada_openai"],
)

gives me this

as where

result = simple_evaluate(
    model="hf-auto",
    model_args=",".join(
        [
            "pretrained=EleutherAI/pythia-160m",
            "revision=step143000",
            # "batch_size=16",
            "parallelize=True",
        ]
    ),
    tasks=["lambada_openai"],
)

gives me this

yuzc19 · 2024-01-20T20:36:34Z

Thank you, @inf3rnus. This is exactly what I met before, and I feel like the model experiences sort of catastrophic forgetting in the steps trained after 100000. I didn't figure out why, but lambada is definitely not a quite stable evaluation task.

uSaiPrashanth closed this as completed Oct 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mismatch about the evaluation results #118

Mismatch about the evaluation results #118

yuzc19 commented Sep 21, 2023 •

edited

uSaiPrashanth commented Sep 21, 2023

yuzc19 commented Sep 21, 2023

uSaiPrashanth commented Sep 22, 2023

yuzc19 commented Sep 22, 2023

yuzc19 commented Sep 22, 2023

yuzc19 commented Sep 22, 2023

uSaiPrashanth commented Sep 24, 2023

yuzc19 commented Sep 25, 2023

uSaiPrashanth commented Oct 30, 2023

inf3rnus commented Jan 19, 2024 •

edited

yuzc19 commented Jan 20, 2024 •

edited

Mismatch about the evaluation results #118

Mismatch about the evaluation results #118

Comments

yuzc19 commented Sep 21, 2023 • edited

uSaiPrashanth commented Sep 21, 2023

yuzc19 commented Sep 21, 2023

uSaiPrashanth commented Sep 22, 2023

yuzc19 commented Sep 22, 2023

yuzc19 commented Sep 22, 2023

yuzc19 commented Sep 22, 2023

uSaiPrashanth commented Sep 24, 2023

yuzc19 commented Sep 25, 2023

uSaiPrashanth commented Oct 30, 2023

inf3rnus commented Jan 19, 2024 • edited

yuzc19 commented Jan 20, 2024 • edited

yuzc19 commented Sep 21, 2023 •

edited

inf3rnus commented Jan 19, 2024 •

edited

yuzc19 commented Jan 20, 2024 •

edited