Can't Do That Anymore Eval (openai#1487)

@JunShern will review this # Thank you for contributing an eval! ♥️ 🚨 Please make sure your PR follows these guidelines, **failure to follow the guidelines below will result in the PR being closed automatically**. Note that even if the criteria are met, that does not guarantee the PR will be merged nor GPT-4 access be granted. 🚨 **PLEASE READ THIS**: In order for a PR to be merged, it must fail on GPT-4. We are aware that right now, users do not have access, so you will not be able to tell if the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep in mind as we run the eval, if GPT-4 gets higher than 90% on the eval, we will likely reject it since GPT-4 is already capable of completing the task. We plan to roll out a way for users submitting evals to see the eval performance on GPT-4 soon. Stay tuned! Until then, you will not be able to see the eval performance on GPT-4. **Starting April 10, the minimum eval count is 15 samples, we hope this makes it easier to create and contribute evals.** Also, please note that we're using **Git LFS** for storing the JSON files, so please make sure that you move the JSON file to Git LFS before submitting a PR. Details on how to use Git LFS are available [here](https://git-lfs.com). ## Eval details 📑 ### Eval name Can't do that anymore ### Eval description Evaluates how well models can adapt their predictions, even when they have strong biases ### What makes this a useful eval? [Insert why this eval is worth including and any additional context] ## Criteria for a good eval ✅ Below are some of the criteria we look for in a good eval. In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals). Your eval should be: - [x] Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world. - [x] Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not. - [x] Includes good signal around what is the right behavior. This means either a correct answer for `Basic` evals or the `Fact` Model-graded eval, or an exhaustive rubric for evaluating answers for the `Criteria` Model-graded eval. - [x] **Include at least 15 high-quality examples.** If there is anything else that makes your eval worth including, please document it below. ### Unique eval value > Insert what makes your eval high quality that was not mentioned above. (Not required) ## Eval structure 🏗️ Your eval should - [x] Check that your data is in `evals/registry/data/{name}` - [x] Check that your YAML is registered at `evals/registry/evals/{name}.yaml` - [x] Ensure you have the right to use the data you submit via this eval (For now, we will only be approving evals that use one of the existing eval classes. You may still write custom eval classes for your own cases, and we may consider merging them in the future.) ## Final checklist 👀 ### Submission agreement By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (<https://platform.openai.com/docs/usage-policies>). - [x] I agree that my submission will be made available under an MIT license and complies with OpenAI's usage policies. ### Email address validation If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the commits on the merged pull request. - [x] I acknowledge that GPT-4 access will only be granted, if applicable, to the email address used for my merged pull request. ### Limited availability acknowledgment We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and the high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR. - [x] I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be merged nor GPT-4 access be granted. ### Submit eval - [x] I have filled out all required fields of this form - [x] I have used **Git LFS** for the Eval JSON data - [x] (Ignore if not submitting code) I have run `pip install pre-commit; pre-commit install` and have verified that `mypy`, `black`, `isort`, `autoflake` and `ruff` are running when I commit and push Failure to fill out all required fields will result in the PR being closed. ### Eval JSON data Since we are using Git LFS, we are asking eval submitters to add in as many Eval Samples (at least 5) from their contribution here: <details> <summary>View evals in JSON</summary> ### Eval ```jsonl INSERT_EVAL_HERE ``` </details>
hvaara · Mar 19, 2024 · f72afb9 · f72afb9
1 parent ad377e4
commit f72afb9
Show file tree

Hide file tree

Showing 23 changed files with 2,352 additions and 0 deletions.
diff --git a/evals/elsuite/cant_do_that_anymore/README.md b/evals/elsuite/cant_do_that_anymore/README.md
@@ -0,0 +1,69 @@
+# Can't Do That Anymore
+
+We evaluate how well models can adapt to new rules of an environment, by applying novel reasoning to a task rather than following biases seen during their training. We task models to play a variant of chess and evaluate whether they can avoid making moves that are ordinarily legal, but are illegal in our variant which has slightly different rules. In our variant of chess, bishops move as knights do.
+
+## Usage
+
+Run with:
+
+```
+oaieval <solver> cant_do_that_anymore
+```
+
+We suggest using `generation/direct/gpt-3.5-turbo` or `generation/direct/gpt-4-turbo-preview` as default choices for `<solver>`
+
+For more examples of running this eval, see `scripts/run_experiments.sh`
+
+## Dataset
+
+For each model we evaluate, we construct a dataset where every sample contains a board position and the next move that was played, which is legal for the board position under the normal rules of chess, but illegal under the rules of our variant (i.e. the next move is a bishop moving diagonally). We call these types of moves *special moves*. We additionally filter to only include special moves that the model would have predicted under temperature=0 with the normal rules. We can use this to evaluate if models will change their predictions when given the variant rules, despite normally strongly predicting the move under the normal rules.
+
+Each model's dataset is automatically found and loaded upon running the eval. If a dataset doesn't exist for a particular solver, one will automatically be constructed for it.
+
+## Evaluation Process
+
+Samples from the dataset are evaluated one-by-one. Each sample contains a board position and the special move (next move). We prompt models to predict the next best move given the board position, separately under both the normal rules of chess and our variant's rules. We then measure whether the model predicted the special move from the sample under both rule settings. If the model was perfectly following the given rules, we'd expect it to never predict the special move under the variant's rules.
+
+To see how we prompt models under each rule setting, see `defaults.py`.
+
+## Metrics
+
+The below are the key metrics of this eval:
+
+| Metric | Interpretation |
+| --- | --- |
+| `variant_impact_factor` | The relative decrease in special move predictions when under the variant's rules, relative to the special move predictions under the normal rules. Lower is better, perfect score is -1.
+| `delta` | The absolute decrease in predicting the special move when under the variant's rules, relative to the models predictions under the normal rules. Lower is better.
+| `predicted_move_proportion` | The proportion of examples where the model predicted the special move under the normal rules.
+| `predicted_move_in_variant_proportion` | The proportion of examples where the model predicted the special move under the variant's rules.
+| `avg_num_previous_moves` | Average number of previous moves leading up to the board positions across all samples.
+| `std_num_previous_moves` | Standard deviation of the number of previous moves leading up to the board positions across all samples.
+
+## Variants
+
+| Variant | Notes |
+| --- | --- |
+| Default: `cant_do_that_anymore.all` | Default setting. Each dataset has 1000 samples. |
+| `cant_do_that_anymore.all_small` | A smaller version of the default setting. Each dataset has 100 samples. |
+| `cant_do_that_anymore.all_diagonal` | In this variant, we measure the proportion of samples (board positions) where the model will attempt to move a bishop diagonally. |
+
+## Custom Solvers
+
+We use two custom solvers for the base models we evaluate: `chess/generation/direct/gpt-3.5-turbo-instruct` and `chess/generation/direct/gpt-4-base`. These only generate up to four tokens, which prevents the base models from simulating the entire game.
+
+## Token Usage Estimates
+
+Below is a rough estimate of the total number of tokens used by the default variant:
+
+| Solver | Input Tokens | Output Tokens | Total Tokens |
+| --- | --- | --- | --- |
+| generation/direct/gpt-3.5-turbo | 375,000 | 10,000 | 385,000 |
+| generation/direct/gpt-4-turbo-preview | 375,000 | 10,000 | 385,000 |
+
+## Version History
+
+- v0: Initial version released
+
+## Contribution statement
+
+Eval design, implementation, and results evaluation was primarily conducted by Oliver Jaffe with contributions from Giulio Starace, under the guidance of (alphabetically by last-name) Steven Adler, James Aung, and Chan Jun Shern who scoped and managed the broader research project, including input on evaluation design, results analysis, and interpretation.
diff --git a/evals/elsuite/cant_do_that_anymore/chess/board.py b/evals/elsuite/cant_do_that_anymore/chess/board.py
@@ -0,0 +1,244 @@
+import copy
+from typing import Callable, Dict, Sequence
+
+from evals.elsuite.cant_do_that_anymore.chess.notation import NotationParser
+from evals.elsuite.cant_do_that_anymore.chess.pieces import Piece
+from evals.elsuite.cant_do_that_anymore.chess.utils import (
+ Move,
+ get_other_player_id,
+ get_path_between_coords,
+ parse_piece,
+)
+
+
+class Board:
+ """
+ Represents one board position. Is instantiated several times
+ by the BoardController to simulate future boards after playing
+ some moves.
+ """
+
+ def __init__(
+ self,
+ board_state: Sequence[Sequence[str]],
+ piece_id_to_instance: Dict[int, Piece],
+ piece_str_to_id: Dict[str, int],
+ piece_id_to_str: Dict[int, str],
+ ):
+ self.board_state = board_state
+ self.piece_id_to_instance = piece_id_to_instance
+ self.piece_str_to_id = piece_str_to_id
+ self.piece_id_to_str = piece_id_to_str
+
+ def __str__(self) -> str:
+ str_board = [["" for _ in range(8)] for _ in range(8)]
+
+ for row_idx in range(len(self.board_state)):
+ row = self.board_state[row_idx]
+ for col_idx in range(len(row)):
+ piece_color, piece_id = parse_piece(self.board_state, row_idx, col_idx)
+
+ if piece_color != "E":
+ white_piece = piece_color == "W"
+ s = (
+ self.piece_id_to_instance[piece_id].white_render
+ if white_piece
+ else self.piece_id_to_instance[piece_id].black_render
+ )
+ else:
+ s = "\u25A1"
+ str_board[row_idx][col_idx] = s
+
+ # Add letters on bottom
+ str_board += [["-"] * 8]
+ str_board += [["a", "b", "c", "d", "e", "f", "g", "h"]]
+
+ # Add numbers on side
+ str_board = [["|"] + row for row in str_board]
+ numbers = list(range(8, 0, -1)) + [" ", " "]
+ str_board = [[str(numbers[idx])] + row for (idx, row) in enumerate(str_board)]
+
+ # Render as string
+ str_board = "\n".join([" ".join(row) for row in str_board])
+ return str_board
+
+ def _update_board(self, move: Move):
+ """
+ Updates board_state according to given move. This move must have previously been checked
+ to be legal. Edge cases for moves that:
+ 1) Take pieces at other positions where this piece isn't moving (en passant)
+ 2) Move two pieces (castling)
+ 3) Change the id of the piece (promotion)
+ """
+ start_coord, target_coord = move.start_coord, move.target_coord
+ piece_color, piece_id = parse_piece(self.board_state, start_coord[0], start_coord[1])
+ target_piece_color, target_piece_id = parse_piece(
+ self.board_state, target_coord[0], target_coord[1]
+ )
+
+ # En passant
+ if piece_id == 0 and target_piece_color == "E":
+ dy = target_coord[1] - start_coord[1]
+ target_en_passant_piece = [start_coord[0], start_coord[1] + dy]
+ self.board_state[target_en_passant_piece[0]][target_en_passant_piece[1]] = "E"
+
+ # Castling
+ if move.castling:
+ path = get_path_between_coords(start_coord, target_coord)
+ rook_tile = path[0]
+ self.board_state[rook_tile[0]][rook_tile[1]] = f"{piece_color}3"
+
+ kingside = target_coord[1] <= 4
+ old_rook_tile = [start_coord[0], 0] if kingside else [start_coord[0], 7]
+ self.board_state[old_rook_tile[0]][old_rook_tile[1]] = "E"
+
+ # Move piece
+ self.board_state[start_coord[0]][start_coord[1]] = "E"
+ self.board_state[target_coord[0]][target_coord[1]] = f"{piece_color}{piece_id}"
+
+ # Promotion
+ if move.promotion is not None:
+ self.board_state[target_coord[0]][target_coord[1]] = f"{piece_color}{move.promotion}"
+
+ def _get_player_moves(self, player_id: str, previous_moves: Sequence[Move]) -> Sequence[Move]:
+ """
+ Returns all possible moves by pieces for a player. Doesn't filter out moves that
+ result in the king being placed under check
+ """
+ moves = []
+ for row_idx in range(len(self.board_state)):
+ row = self.board_state[row_idx]
+ for col_idx in range(len(row)):
+ piece_color, piece_id = parse_piece(self.board_state, row_idx, col_idx)
+ if piece_color != player_id:
+ continue
+
+ piece = self.piece_id_to_instance[piece_id]
+ possible_piece_moves = piece.get_piece_moves(
+ self.board_state, player_id, [row_idx, col_idx], previous_moves
+ )
+ moves += possible_piece_moves
+
+ return moves
+
+ def _is_king_in_check(self, player_id: str) -> bool:
+ other_player_id = get_other_player_id(player_id)
+
+ other_player_moves = self._get_player_moves(other_player_id, [])
+ king_capturing_moves = self._filter_for_king_capturing_moves(other_player_moves, player_id)
+ return len(king_capturing_moves) != 0
+
+ def _filter_for_king_capturing_moves(
+ self, moves: Sequence[Move], king_color: str
+ ) -> Sequence[Move]:
+ king_capturing_moves = []
+ for move in moves:
+ piece_color, piece_id = parse_piece(
+ self.board_state, move.target_coord[0], move.target_coord[1]
+ )
+ if piece_color == king_color and piece_id == 5:
+ king_capturing_moves.append(move)
+
+ return king_capturing_moves
+
+
+class BoardController:
+ """
+ Manages a single game of chess. Contains logic to find all legal
+ moves for a particular player and update the internal board according
+ to a given move. Maintains one Board obj to represent the true state of play
+ """
+
+ def __init__(
+ self,
+ board_init: Callable[..., Sequence[Sequence[str]]],
+ piece_id_to_instance: Dict[int, Piece],
+ piece_str_to_id: Dict[str, int],
+ piece_id_to_str: Dict[int, str],
+ notation_parser: NotationParser,
+ ):
+ self.board = Board(board_init(), piece_id_to_instance, piece_str_to_id, piece_id_to_str)
+ self.notation_parser = notation_parser
+
+ self.previous_moves = []
+
+ def __str__(self) -> str:
+ return self.board.__str__()
+
+ def update_board(self, move: str):
+ """
+ Parses move, updates the internal board state, then stores the move
+ since knowing previous moves is necessary for En Passant and castling
+ """
+ move = self.notation_parser._str_to_move(move, self.board.board_state)
+ self.board._update_board(move)
+ self.previous_moves.append(move)
+
+ def get_player_legal_moves(self, player_id: str) -> Sequence[str]:
+ """
+ Gets all legal moves for a player with the given player_id, returned in
+ the notation this object was initialised with
+ """
+ legal_moves = self.board._get_player_moves(player_id, self.previous_moves)
+ legal_moves = self._filter_to_prevent_pinning(legal_moves, player_id)
+
+ legal_moves = [
+ self.notation_parser._move_to_str(i, self.board.board_state) for i in legal_moves
+ ]
+ return legal_moves
+
+ def _filter_to_prevent_pinning(self, moves: Sequence[Move], player_id: str) -> Sequence[Move]:
+ """
+ Filter out moves that would result in the king being pinned, or the king moving over a pinned
+ position when castling
+ """
+
+ def _is_valid_castling(move: Move) -> bool:
+ if self.board._is_king_in_check(player_id):
+ return False
+
+ # Check that the king won't move over an attacked position
+ dy = (move.target_coord[1] - move.start_coord[1]) / abs(
+ move.target_coord[1] - move.start_coord[1]
+ )
+ king_path = get_path_between_coords(
+ move.start_coord, [move.target_coord[0], move.target_coord[1] + dy]
+ )
+
+ not_pinned_along_path = []
+ for coord in king_path:
+ simulated_board = copy.deepcopy(self.board)
+ simulated_board._update_board(
+ Move(move.start_coord, coord, promotion=None, castling=False)
+ )
+ pinned = simulated_board._is_king_in_check(player_id)
+ not_pinned_along_path.append(not pinned)
+
+ if all(not_pinned_along_path):
+ return True
+
+ return False
+
+ filtered_moves = []
+ for move in moves:
+ if move.castling and _is_valid_castling(move):
+ filtered_moves.append(move)
+ elif not move.castling:
+ simulated_board = copy.deepcopy(self.board)
+ simulated_board._update_board(move)
+ if not simulated_board._is_king_in_check(player_id):
+ filtered_moves.append(move)
+
+ return filtered_moves
+
+ def _is_checkmate(self, player_id: str) -> bool:
+ legal_moves = self.get_player_legal_moves(player_id)
+ if len(legal_moves) == 0 and self.board._is_king_in_check(player_id):
+ return True
+ return False
+
+ def _is_stalemate(self, player_id: str) -> bool:
+ legal_moves = self.get_player_legal_moves(player_id)
+ if len(legal_moves) == 0 and not self.board._is_king_in_check(player_id):
+ return True
+ return False