Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT: Crescendo part 1 (scorers) #203

Closed
wants to merge 16 commits into from
Closed

Conversation

cseifert1
Copy link
Contributor

Description

Implement the first part of the crescendo attack strategy. So far adding the relevant scorers and prompt templates.

Tests and Documentation

Added unit tests and scoring notebook sections to test out the scorers.

@@ -2,7 +2,7 @@
"cells": [
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can remove this file for now

@@ -113,3 +115,10 @@ def _validate(self, scorer_type, score_value):
raise ValueError(f"Float scale scorers must have a score value between 0 and 1. Got {score_value}")
except ValueError:
raise ValueError(f"Float scale scorers require a numeric score value. Got {score_value}")
elif scorer_type == "severity":
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can use float_scale for this; there is a method to normalize any value to one between 0.0 and 1.0

IMO we should delete this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think we should. The severity documented are specific int values. While we can represent it as a float (e.g. 2-> 0.02) I think it would be confusing for a caller.

# These LLMs are heavily aligned to resist engaging in illegal or unethical topics as a means to avoid contributing to responsible AI harms.
# However, a recent line of attacks, known as "jailbreaks", seek to overcome this alignment. Intuitively, jailbreak attacks aim to narrow
# the gap between what the model can do and what it is willing to do. In this paper, we introduce a novel jailbreak attack called Crescendo.
# Unlike existing jailbreak methods, Crescendo is a multi-turn jailbreak that interacts with the model in a seemingly benign manner.
Copy link
Contributor

@rlundeen2 rlundeen2 May 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a great start!

But can we remove all non-scoring pieces to this PR? E.g. this, anything and anything with the orchestrator

E.g. remove this, orchestrator, etc. (you can keep the code for future PRs, just branch off of your current branch and remove them from this from the current PR)

description: |
A variant of the crescendo attack technique
harm_category: NA
author: Ahmed Salem
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd remove these for now

@@ -264,6 +264,13 @@ def __str__(self):
return self.strategy.apply_custom_metaprompt_parameters(**self.kwargs)


class ToolCall(BaseModel):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not clear to me what this is


self._harm_category = harm_category

if azure_content_safety_key is not None and azure_content_safety_endpoint is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add env keys for this and get the default values? See most of our targets on examples of how to do this. It should also be added to our env_example :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. self._api_key = default_values.get_required_value(
env_var_name=self.API_KEY_ENVIRONMENT_VARIABLE, passed_value=api_key
)


request = AnalyzeTextOptions(text=request_response.converted_value, output_type="EightSeverityLevels")
# Analyze text and geting the results for the category specified

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should use AnalyzeTextAsync and await the method

"""
self.validate(request_response)

request = AnalyzeTextOptions(text=request_response.converted_value, output_type="EightSeverityLevels")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to do images here too, and just AnaylzeImage depending ont the reqeust_response data type

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(not a blocker, but nice to have)

# Analyze text and geting the results for the category specified

response = self._azureCFClient.analyze_text(request)
result = next((item for item in response.categories_analysis if item.category == self._harm_category), None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set score_type to "float_scale"

And for score_value call this (I think they range from 1 to 8?)

score_value = self.scale_value_float(float(parsed_response["score_value"]), 1, 8)

score = Score(
score_type="severity",
score_value=result.severity,
score_value_description="severity",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

description can be "none" or to be consistent it would be with whatever azure defines as the specific descriptions for each of the levels

from pyrit.models import PromptRequestPiece


class AzureContentFilter(Scorer):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rename to AzureContentFilterScorer (and rename this class)

Copy link
Contributor

@rlundeen2 rlundeen2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this PR, you may want to just have "azure_content_filter_scorer". That is a concrete piece of work that is close to done and would immediately provide value. Then You could title this PR "Azure Content Filter Scorer"

The second PR I would include the other three scorers: eval_prompt, meta_judge_prompt, and refuse_judge_propmt

I really like these three prompts eval_prompt, meta_judge_prompt, and refuse_judge_propmt. Broadly, I think they should be broken up into two separate scorers

  • SelfAskVerifyScore, which contains meta_judge prompt and is a TrueFalse scorer
  • SelfAskConversationObjectiveScorer, I would also probably make this a TrueFalse scorer, and put the value from 0 to 1 in metadata. I like that result_percentage, we may add something like that to all TrueFalse scorers, but for now I think metadata will allow you to do operations on it.

IMO this would allow us to reuse, and each of these will be useful in other scoring situations.

There are some nits too, like the prompts should go in datasets.

I also really like the various crescendo scorers. I would make that the second pr. Broadly, I think they should be broken up into three separate scorers

@romanlutz
Copy link
Contributor

Closing in favor of #275

@romanlutz romanlutz closed this Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants