Skip to content

Latest commit

 

History

History
263 lines (182 loc) · 20.2 KB

README.md

File metadata and controls

263 lines (182 loc) · 20.2 KB

GuardRails Evaluation

We propose a set of tools that can be used to evaluate the different types of rails implemented in NeMo Guardrails. In the current version, these tools are designed to test the performance of each type of rail individually.

The evaluation tools can be easily used from the CLI. Examples will be provided for each type of rail.

At the same time, we provide preliminary results on the performance of the rails on a set of public datasets, relevant for each task at hand.

Topical Rails

Aim and Usage

Topical rails evaluation focuses on the core mechanism used by NeMo Guardrails to guide conversations using canonical forms and dialogue flows. More details about this core functionality is explained here.

Thus, when using topical rails evaluation, we are actually assessing the performance for:

  1. User canonical form (intent) generation
  2. Next step generation - in the current approach, we only assess the performance of bot canonical forms as next step in a flow
  3. Bot message generation

To use the topical rails evaluation tool, the CLI command is:

nemoguardrails evaluate topical --config=<rails_app_path> --verbose

A topical rails evaluation has the following CLI parameters:

  • config: The Guardrails app to be evaluated.
  • verbose: If the Guardrails app should be run in verbose mode
  • test-percentage: Percentage of the samples for an intent to be used as test set
  • max-tests-intent: Maximum number of test samples per intent to be used when testing (useful to have balanced test data for unbalanced datasets). If the value is 0, this parameter is not used.
  • max-samples-intent: Maximum number of samples per intent to be used in the vector database. If the value is 0, all samples not in test set are used.
  • results-frequency: If we want to print intermediate results about the current evaluation, this is the step.
  • sim-threshold: If larger than 0, for intents that do not have an exact match pick the most similar intent above this threshold.
  • random-seed: Random seed used by the evaluation.
  • output-dir: Output directory for predictions.

Evaluation Results

For the initial evaluation experiments for topical rails, we have used two datasets used for conversational NLU:

The datasets were transformed into a NeMo Guardrails app, by defining canonical forms for each intent, specific dialogue flows, and even bot messages (for the chit-chat dataset alone). The two datasets have a large number of user intents, thus topical rails. One of them is very generic and with higher-grained intents (chit-chat), while the banking dataset is domain-specific and more fine-grained. More details about running the topical rails evaluation experiments and the evaluation datasets is available here.

Preliminary evaluation results follow next. In all experiments, we have chosen to have a balanced test set with at most 3 samples per intent. For both datasets, we have assessed the performance for various LLMs and also for the number of samples (k = all, 3, 1) per intent that are indexed in the vector database.

Take into account that the performance of an LLM is heavily dependent on the prompt, especially due to the more complex prompt used by Guardrails. Therefore, at the current moment we only release the results for OpenAI models, but more results will follow in next releases. All results are preliminary as better prompting can improve them.

Important lessons to be learned from the evaluation results:

  • Each step in the three-step approach (user intent, next step / bot intent, bot message) used by Guardrails offers an improvement in performance.
  • It is important to have at least k=3 samples in the vector database for each user intent (canonical form) for achieving good performance.
  • Some models (e.g., gpt-3.5-turbo) produce a wider variety of canonical forms, even with the few-shot prompting used by Guardrails. In these cases, it is useful to add a similarity match instead of exact match for user intents. In this case, the similarity threshold becomes an important inference parameter.
  • Initial results show that even small models, e.g. dolly-v2-3b, vicuna-7b-v1.3, mpt-7b-instruct, falcon-7b-instruct have good performance for topical rails.
  • Using a single call for topical rails shows similar results to the default method (which uses up to 3 LLM calls for generating the final bot message) in most cases for text-davinci-003 model.
  • Initial experiments show that using compact prompts has similar or even better performance on these two datasets compared to using the longer prompts.

Evaluation Date - June 21, 2023. Updated July 24, 2023 for Dolly, Vicuna and Mosaic MPT models.

Dataset # intents # test samples
chit-chat 76 226
banking 77 231

Results on chit-chat dataset, metric used is accuracy.

Model User intent, w.o sim User intent, sim=0.6 Bot intent, w.o sim Bot intent, sim=0.6 Bot message, w.o sim Bot message, sim=0.6
text-davinci-003, k=all 0.89 0.89 0.90 0.90 0.91 0.91
text-davinci-003, k=all, single call 0.89 N/A 0.91 N/A 0.91 N/A
text-davinci-003, k=all, compact 0.90 N/A 0.91 N/A 0.91 N/A
text-davinci-003, k=3 0.82 N/A 0.85 N/A N/A N/A
text-davinci-003, k=1 0.65 N/A 0.73 N/A N/A N/A
gpt-3.5-turbo-instruct, k=all 0.88 N/A 0.88 N/A 0.88 N/A
gpt-3.5-turbo-instruct, single call 0.90 N/A 0.91 N/A 0.91 N/A
gpt-3.5-turbo-instruct, compact 0.89 N/A 0.89 N/A 0.90 N/A
gpt-3.5-turbo, k=all 0.44 0.56 0.50 0.61 0.54 0.65
llama2-13b-chat, k=all 0.87 N/A 0.88 N/A 0.89 N/A
dolly-v2-3b, k=all 0.80 0.82 0.81 0.83 0.81 0.83
vicuna-7b-v1.3, k=all 0.62 0.75 0.69 0.77 0.71 0.79
mpt-7b-instruct, k=all 0.73 0.81 0.78 0.82 0.80 0.82
falcon-7b-instruct, k=all 0.81 0.81 0.81 0.82 0.82 0.82

Results on banking dataset, metric used is accuracy.

Model User intent, w.o sim User intent, sim=0.6 Bot intent, w.o sim Bot intent, sim=0.6 Bot message, w.o sim Bot message, sim=0.6
text-davinci-003, k=all 0.77 0.82 0.83 0.84 N/A N/A
text-davinci-003, k=all, single call 0.75 N/A 0.81 N/A N/A N/A
text-davinci-003, k=all, compact 0.86 N/A 0.86 N/A N/A N/A
text-davinci-003, k=3 0.65 N/A 0.73 N/A N/A N/A
text-davinci-003, k=1 0.50 N/A 0.63 N/A N/A N/A
gpt-3.5-turbo-instruct, k=all 0.73 N/A 0.74 N/A N/A N/A
gpt-3.5-turbo-instruct, single call 0.81 N/A 0.83 N/A N/A N/A
gpt-3.5-turbo-instruct, compact 0.86 N/A 0.87 N/A N/A N/A
gpt-3.5-turbo, k=all 0.38 0.73 0.45 0.73 N/A N/A
llama2-13b-chat, k=all 0.76 N/A 0.77 N/A N/A N/A
dolly-v2-3b, k=all 0.32 0.62 0.40 0.64 N/A N/A
vicuna-7b-v1.3, k=all 0.39 0.62 0.54 0.65 N/A N/A
mpt-7b-instruct, k=all 0.45 0.58 0.50 0.60 N/A N/A
falcon-7b-instruct, k=all 0.70 0.75 0.76 0.78 N/A N/A

Execution Rails

Fact-checking Rails

The default fact checking rail is implemented as an entailment prediction problem. Given an evidence and the predicted answer, we make an LLM call to predict whether the answer is grounded in the evidence or not.

To run the fact checking rail, you can use the following CLI command:

nemoguardrails evaluate fact-checking

Here is a list of arguments that you can use to configure the fact checking rail:

  • dataset-path: Path to the dataset. It should be a json file with the following format:

    [
        {
            "question": "question text",
            "answer": "answer text",
            "evidence": "evidence text",
        },
    }
    

    ,

  • llm: The LLM provider to use. Default is openai.

  • model-name: The name of the model to use. Default is text-davinci-003.

  • num-samples: Number of samples to run the eval on. Default is 50.

  • create-negatives: Whether to generate synthetic negative examples or not. Default is True.

  • output-dir: The directory to save the output to. Default is eval_outputs/factchecking.

  • write-outputs: Whether to write the outputs to a file or not. Default is True.

More details on how to set up the data in the right format and run the evaluation on your own dataset can be found here.

Evaluation Results

We evaluate the performance of the fact checking rail on the MSMARCO dataset. We randomly sample 100 (question, answer, evidence) triples and run the evaluation using OpenAI text-davinci-003 and gpt-3.5-turbo models.

Evaluation Date - June 02, 2023.

We breakdown the performance into positive entailment accuracy and negative entailment accuracy. Positive entailment accuracy is the accuracy of the model in correctly identifying answers that are grounded in the evidence passage. Negative entailment accuracy is the accuracy of the model on correctly identifying answers that are not grounded in the evidence. Details on how to create synthetic negative examples can be found here

Model Positive Entailment Accuracy Negative Entailment Accuracy Overall Accuracy
text-davinci-003 0.83 0.87 0.85
gpt-3.5-turbo 0.87 0.80 0.83
nemollm-43b 0.80 0.83 0.81

Moderation Rails

The moderation rails involve two components - the jailbreak detection rail and the output moderation rail.

  • The jailbreak detection rail attempts to flag user inputs that could potentially cause the model to output unsafe content.
  • The output moderation rail attempts to filter the language model output to avoid unsafe content from being displayed to the user.

The jailbreak and output moderation rails can be evaluated using the following CLI command:

nemoguardrails evaluate moderation

The various arguments that can be passed to evaluate the moderation rails are

  • model_name: Name of the model to use. Default is 'text-davinci-003'.
  • llm: Name of the LLM provide. Default is 'openai'.
  • dataset-path: Path to the dataset to evaluate the rails on. The dataset should contain one prompt per line.
  • split: The split of the dataset. This can be either 'helpful' or 'harmful'. This is used to determine the appropriate label for the predictions.
  • num-samples: Number of samples to evaluate. Default is 50.
  • check-jailbreak: Whether to evaluate the jailbreak rail. Default is True.
  • check-output_moderation: Whether to evaluate the output moderation rail. Default is True.
  • output-path: Folder to write the results to. Default is 'eval_outputs/moderation'.

It is also possible to evaluate each of the rails individually. To evaluate the jailbreak rail only, use the following command:

nemoguardrails evaluate moderation --check-output-moderation False

and to evaluate the output moderation rail only, use the following command:

nemoguardrails evaluate moderation --check-jailbreak False

More details on how to set up the data in the right format and run the evaluation on your own dataset can be found here.

Evaluation Results

We evaluate the moderation rails on the Anthropic Red Team Attempts dataset and the Anthropic Helpful Base dataset. This dataset contains prompts that are labeled by humans as either helpful or harmful. We randomly sample 100 prompts from each of the splits and run the evaluation using OpenAI text-davinci-003 and gpt-3.5-turbo models.

Evaluation Date - June 02, 2023.

We breakdown the performance of the models on the two rails into the following metrics:

  • % of the prompts that are blocked on the Red Team Attempts dataset
  • % of the prompts that are allowed on the Helpful Base dataset

We want the models to block as many harmful prompts as possible and allow as many helpful prompts as possible.

Moderation Rails Performance

Model % of harmful prompts blocked % of helpful prompts allowed
text-davinci-003 80 97
gpt-3.5-turbo 70 100
nemollm-43b 88 84

Hallucination Rails

For general questions that the model uses parametric knowledge to answer, we can define a hallucination rail to detect when the model is potentially making up facts. The default implementation of the hallucination rails is based on SelfCheckGPT.

  • Given a question, we sample multiple answers from the model, often at a high temperature (temp=1.0).
  • We then check if the answers are consistent with each other. This agreement check is implemented using an LLM call similar to the fact checking rail.
  • If the answers are inconsistent, it indicates that the model might be hallucinating.

To run the hallucination rail, use the following CLI command:

nemoguardrails evaluate hallucination

Here is a list of arguments that you can use to configure the hallucination rail:

  • dataset-path: Path to the dataset. It should be a text file with one question per line.
  • llm: The LLM provider to use. Default is openai.
  • model-name: The name of the model to use. Default is text-davinci-003.
  • num-samples: Number of samples to run the eval on. Default is 50.
  • output-dir: The directory to save the output to. Default is eval_outputs/hallucination.
  • write-outputs: Whether to write the outputs to a file or not. Default is True.

To evaluate the hallucination rail on your own dataset, you can follow the create a text file with the list of questions and run the evaluation using the following command

nemoguardrails evaluate hallucination --dataset-path <path-to-your-text-file>

Evaluation Results

To evaluate the hallucination rail, we manually curate a set of questions which mainly consists of questions with a false premise, i.e., questions that cannot have a correct answer.

For example, the question "What is the capital of the moon?" has a false premise since the moon does not have a capital. Since the question is stated in a way that implies that the moon has a capital, the model might be tempted to make up a fact and answer the question.

We then run the hallucination rail on these questions and check if the model is able to detect the hallucination. We run the evaluation using OpenAI text-davinci-003 and gpt-3.5-turbo models.

Evaluation Date - June 12, 2023.

We breakdown the performance into the following metrics:

  • % of questions that are intercepted by the model, i.e., % of questions where the model detects are not answerable
  • % of questions that are intercepted by model + hallucination rail, i.e., % of questions where the either the model detects are not answerable or the hallucination rail detects that the model is making up facts
Model % intercepted - model % intercepted - model + hallucination rail
text-davinci-003 0 70
gpt-3.5-turbo 65 90

We find that gpt-3.5-turbo is able to intercept 65% of the questions and identify them as not answerable on its own. Adding the hallucination rail helps intercepts 25% more questions and prevents the model from making up facts.