Skip to content

Latest commit

 

History

History
 
 

chatgpt-robust

Robustness evaluation of ChatGPT

This repo contains the source code in the paper On the Robustness of ChatGPT: An Adversarial and Out-of-distribution Perspective.

This project is to evaluate the robustness of ChatGPT as well as some foundation language models. You can find the results here.

Pre-requisites

First, clone and get into the repo:

git clone https://github.com/microsoft/robustlearn.git
cd robustlearn/chatgpt-robust

Then, install the following important depencencies by:

You can also create an conda virtual environment by running conda env create -f environment.yml.

Usage

All things can be used by running main.py:

For classification tasks:

  • Use Huggingface: python main.py --dataset advglue --task sst2 --service hug --model xxx
  • Use GPT API: python main.py --dataset advglue --task sst2 --service gpt --model text-davinci-003

For translation tasks:

  • Use Huggingface: python main.py --dataset advglue-t --task translation_en_to_zh --service hug --model xx

Results

Note that you will not get the final results by simply running the codes, since the outputs of generative models are not stable. We need some manual processing. Bad cases of AdvGLUE and Flipkart are pvovided in this folder.

Here is the summary of the results.

Adversarial robustness for classification.

The metric is attack success rate (ASR).

Model SST-2 QQP MNLI QNLI RTE ANLI
Random 50.0 50.0 66.7 50.0 50.0 66.7
DeBERTa-L (435 M) 66.9 39.7 64.5 46.6 60.5 69.3
BART-L (407 M) 56.1 62.8 58.7 52.0 56.8 57.7
GPT-J 48.7 59.0 73.6 50.0 56.8 66.5
T5 (11 B) 40.5 59.0 48.8 49.7 56.8 68.6
T0 (11 B) 36.5 60.3 72.7 49.7 56.8 77.2
NEOX-20B 52.7 56.4 59.5 54.0 48.1 70.0
OPT (66 B) 47.6 53.9 60.3 52.7 58.0 58.3
BLOOM (176 B) 48.7 59.0 73.6 49.7 56.8 66.5
text-davinci-002 (175 B) 46.0 28.2 54.6 45.3 35.8 68.8
text-davinci-003 (175 B) 44.6 55.1 44.6 38.5 34.6 62.9
ChatGPT (175 B) 39.9 18.0 32.2 34.5 24.7 55.3

Adversarial robustness for machine translation

The metrics are BLEU, GLEU, and METEOR.

Translation BLEU GLEU METEOR
Helsinki-NLP/opus-mt-en-zh 18.11 26.78 46.38
liam168/trans-opus-mt-en-zh 15.23 24.89 45.02
text-davinci-002 24.97 36.3 59.28
text-davinci-003 30.6 40.01 61.88
ChatGPT 26.27 37.29 58.95

Out-of-distribution robustness

The metric is F1 score.

Model Flipkart DDXPlus
Random 20 4
DeBERTa-L (435 M) 60.6 4.5
BART-L (407 M) 57.8 5.3
GPT-J 28 2.4
T5 (11 B) 58.8 6.3
T0 (11 B) 58.3 8.4
NEOX-20B 39.4 12.3
OPT (66 B) 44.5 0.3
BLOOM (176 B) 28 0.1
text-davinci-002 (175 B) 57.5 18.9
text-davinci-003 (175 B) 57.3 19.6
ChatGPT (175 B) 60.6 20.2

Citation

@article{wang2013robustness,
  title={On the Robustness of ChatGPT: An Adversarial and Out-of-distribution Perspective},
  author={Wang, Jindong and Hu, Xixu and Hou, Wenxin and Chen, Hao and Zheng, Runkai and Wang, Yidong and Yang, Linyi and Huang, Haojun and Ye, Wei and Geng, Xiubo and Jiao, Binxin and Zhang, Yue and Xie, Xing},
  journal={arXiv preprint arXiv:2302.12095},
  year={2023}
}

Disclaimer

Note that the results of some generative models might change due to the nature of generation. Thus, the results of this code repo could also change. Additionally, the results output by generative foundation models are not very clean, so you need to manually process some dirty outputs to get what you want. Our best suggestion to use this code is for demo and practice, since it makes it easy to get the outputs of multiple models. But remember, human evaluation and processing are also important.