-
Notifications
You must be signed in to change notification settings - Fork 268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DRAFT] FEAT: GPTFuzzer Orchestrator #226
base: main
Are you sure you want to change the base?
Conversation
@microsoft-github-policy-service agree company="Centific" |
self._step = 0 # to keep track of the steps or the count | ||
self._last_choice_index = None | ||
self._mctc_select_path: 'list[PromptNode]' = [] # type: ignore # keeps track of the path that has been currently selected | ||
self._rewards = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason why these aren't stored on the PromptNode object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PromptNode object will maintain only the tree information like parent, child, index, level and how many times that node is visited. The above items will be computed at each iteration. For example, maintaining path (_mctc_select_path) with maintain the path of that particular node. Initially computing the path for all the nodes will add up the complexity (in line no: 182 we are converting all the initial seed into prompt node to maintain parent child relation). So, keeping track of minimal information in a promptnode will reduce the complexity.
- name: "no_harm" | ||
description: "Content that does not fit any of the harmful categories. This is the default option. If none of the other categories apply very well, select this option." | ||
# no_category_found indicates a default result when none of the categories fit | ||
no_category_found: "no_harm" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What prompted the changes to this file?
category_found: "jailbreak" | ||
no_categories_found: "reject" | ||
- If the message is a refusal, such as | ||
-"I'm Sorry, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems to be a white space missing before most of these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like you changed the entire file. Any idea why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made the changes locally and just replaced the original file with updated file by mistake.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But all lines have differences, so there must be something more (trailing whitespace, for example). Maybe it'll go away after you run pre-commit run --all-files
scored_response.append( | ||
self._scorer.score_async(response)) | ||
|
||
batch_scored_response = await asyncio.gather(*scored_response) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be a lot. Maybe a batch size would help. With more than a few you'll just overwhelm the scoring target leading to failures. For batching we usually use a method on the normalizer, but the scorer doesn't have that yet if I remember correctly. Perhaps the batching logic itself should move to the scorer to have that batch method available and you can just call it from here and not worry about batching in an orchestrator. Cc @rlundeen2
|
||
#6. Update the rewards for each of the node. | ||
# self._num_jailbreak = sum(score_values) | ||
self._num_jailbreak = score_values.count(True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't need to be on "self" since we don't use it beyond the next few lines, right? Same with the num query
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num_jailbreak is used in computing the reward in the update(). Removed self for num query.
verbose: bool = False, | ||
frequency_weight=0.5, reward_penalty=0.1, minimum_reward=0.2, | ||
non_leaf_nodeprobability =0.1, | ||
random.seed(0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't work. It should be random_seed=None
and then we set the random seed internally
Description
Adding a new Orchestrator based on GPTFuzzer paper which uses MCTS algorithm to select a jailbreak template, apply prompt converter and send it to the target to get a response.
Implemented the MCTS algorithm for the seed selection