Skip to content
/ OCW Public

Code accompanying the paper "Large Language Models are Fixated by Red Herrings: Exploring Creative Problem Solving and Einstellung Effect using the Only Connect Wall Dataset"

License

Notifications You must be signed in to change notification settings

TaatiTeam/OCW

Repository files navigation

RH [NeurIPS 2023] Only Connect Wall (OCW) Dataset

The Only Connect Wall (OCW) dataset contains 618 "Connecting Walls" from the Round 3: Connecting Wall segment of the Only Connect quiz show, collected from 15 seasons' worth of episodes. Each wall contains the ground-truth groups and connections as well as recorded human performance. Please see our paper for more details about the dataset and its motivations.

πŸ“‹ Table of Contents

πŸ“– Usage

Downloading the dataset

The OCW dataset can be downloaded from here or with a bash script:

bash download_OCW.sh

For ablation datasets (OCW-Randomized, OCW-WordNet) download instructions, see here.

Dataset structure

The dataset is provided as JSON files, one for each partition: train.json, validation.json and test.json. We also provide a OCW.json file that contains all examples across all splits. The splits are sized as follows:

Split # Walls
train 62
validation 62
test 494

Here is an example of the dataset's structure:

{
  "season_to_walls_map": {
    "1": {
      "num_walls": 30,
      "start_date": "15/09/2008",
      "end_date": "22/12/2008"
    }
  },
  "dataset": [
    {
      "wall_id": "882c",
      "season": 1,
      "episode": 5,
      "words": [
        "Puzzle",
        "Manhattan",
        "B",
        "Wrench",
        "Smith",
        "Nuts",
        "Brooks",
        "Blanc",
        "Suit",
        "Screwdriver",
        "Sidecar",
        "Margarita",
        "Hammer",
        "Business",
        "Gimlet",
        "Gibson"
      ],
      "gt_connections": [
        "Famous Mels",
        "Household tools",
        "Cocktails",
        "Monkey ___"
      ],
      "groups": {
        "group_1": {
          "group_id": "882c_01",
          "gt_words": [
            "Blanc",
            "Brooks",
            "B",
            "Smith"
          ],
          "gt_connection": "Famous Mels",
          "human_performance": {
            "grouping": 1,
            "connection": 1
          }
        },
        "group_2": {
          "group_id": "882c_02",
          "gt_words": [
            "Screwdriver",
            "Hammer",
            "Gimlet",
            "Wrench"
          ],
          "gt_connection": "Household tools",
          "human_performance": {
            "grouping": 1,
            "connection": 1
          }
        },
        "group_3": {
          "group_id": "882c_03",
          "gt_words": [
            "Sidecar",
            "Manhattan",
            "Gibson",
            "Margarita"
          ],
          "gt_connection": "Cocktails",
          "human_performance": {
            "grouping": 1,
            "connection": 1
          }
        },
        "group_4": {
          "group_id": "882c_04",
          "gt_words": [
            "Puzzle",
            "Business",
            "Nuts",
            "Suit"
          ],
          "gt_connection": "Monkey ___",
          "human_performance": {
            "grouping": 1,
            "connection": 1
          }
        }
      },
      "overall_human_performance": {
        "grouping": [
          1,
          1,
          1,
          1
        ],
        "connections": [
          1,
          1,
          1,
          1
        ]
      }
    }
  ]
}

where

  • "season_to_walls_map" contains the "num_walls" in each season, as well as the "start_date" and "end_date" the season ran
  • "dataset" is a list of dictionaries, where each dictionary contains all accompanying information about a wall:
    • "wall_id": a unique string identifier for the wall
    • "season": an integer representing the season the wall was collected from
    • "episode": an integer representing the episode the wall was collected from
    • "words": a list of strings representing the words in the wall in random order
    • "gt_connections": a list of strings representing the ground truth connections of each group
    • "groups: a dictionary of dictionaries containing the four groups in the wall, each has the following items:
      • "group_id": a unique string identifier for the group
      • "gt_words": a list of strings representing the ground truth words in the group
      • "gt_connection": a string representing the ground truth connection of the group
      • "human_performance: a dictionary containing recorded human performance for the grouping and connections tasks
    • "overall_human_performance": a dictionary containing recorded human performance for the grouping and connections tasks for each group in the wall

Loading the dataset

The three partitions can be loaded the same way as any other JSON file. For example, using Python:

dataset = {
    "train": json.load(open("./dataset/train.json", "r"))["dataset"],
    "validation": json.load(open("./dataset/validation.json", "r"))["dataset"],
    "test": json.load(open("./dataset/test.json", "r"))["dataset"],
}

However, it is likely easiest to work with the dataset using the HuggingFace Datasets library:

# pip install datasets
from datasets import load_dataset

dataset = load_dataset("TaatiTeam/OCW")

# The dataset can be used like any other HuggingFace dataset
# E.g. get the wall_id of the first example in the train set
dataset["train"]["wall_id"][0]
# or get the words of the first 10 examples in the test set
dataset["test"]["words"][0:10]

Note The HuggingFace copy does not include the "season_to_walls_map"

Evaluating

We provide a script for evaluating the performance of a model on the dataset. Before running, make sure you have installed the requirements and package:

pip install -r requirements.txt
pip install -e .

Then, ensure your model predictions are formatted as follows in a JSON file:

[{
    "wall_id": "882c_01",
    "predicted_groups": [
        ["Puzzle", "Manhattan", "B", "Wrench"],
        ["Smith", "Nuts", "Brooks", "Blanc"],
        ["Suit", "Screwdriver", "Sidecar", "Margarita"],
        ["Hammer", "Business", "Gimlet", "Gibson"]
    ],
    "predicted_connections": ["Famous Mels", "Household tools", "Cocktails", "Monkey ___"]
}]

Note, only one of "predicted_groups" or "predicted_connections" is required. The other can be null. In the evaluate script, predicting groups is considered "task1-grouping" and predicting connections is considered "task2-connections".

To run the evaluation script:

python src/ocw/evaluate_only_connect.py \
    --prediction-file "./predictions/task1.json" \
    --dataset-path "./dataset/" \
    --results-path "./results/" \
    --task "task1-grouping"

Downloading additional datasets for ablation studies

We also produced two "controlled" versions of the dataset, designed to reduce or entirely remove the number of red herrings, for abalation on 'easier' data:

  • A copy of the dataset where each wall in the test set is replaced with a random selection of groups. No group is repeated twice, and no wall contains two copies of the same clue. The train and validation sets are unmodified. This dataset can be downloaded from here or with a bash script:
bash download_OCW_randomized.sh
  • A copy of the dataset generated from WordNet by selecting equivalent synonyms for each clue in a group. This dataset can be downloaded from here or with a bash script:
bash download_OCW_wordnet.sh

Both of these datasets can be loaded from HuggingFace Datasets:

# pip install datasets
from datasets import load_dataset

ocw_randomized = load_dataset("TaatiTeam/OCW", "ocw_randomized")
ocw_wordnet = load_dataset("TaatiTeam/OCW", "ocw_wordnet")

Running the baselines

Word Embeddings and Pre-trained Language Models

To run word embeddings and PLM baseline:

python scripts/prediction.py \
    --model-name "intfloat/e5-base-v2" \
    --dataset-path "./dataset/" \
    --predictions-path "./predictions/" \
    --task "task1-grouping"

The model_name should be from huggingface model hub or in ['elmo', 'glove', 'crawl', 'news']. To run contextualized embeddings in PLMs, use --contextual flag.

To plot the results:

python scripts/plot.py \
    --wall-id "8cde" \
    --model-name "intfloat/e5-base-v2" \
    --shuffle-seed 9

Large Language Models

To run the few-shot in-context LLM baseline, see the run_openai.ipynb notebook. Note: this will require an OpenAI API key.

✍️ Contributing

We welcome contributions to this repository (noticed a typo? a bug?). To propose a change:

git clone https://github.com/salavina/OCW
cd OCW
git checkout -b my-branch
pip install -r requirements.txt
pip install -e .

Once your changes are made, make sure to lint and format the code (addressing any warnings or errors):

isort .
black .
flake8 .

Then, submit your change as a pull request.

πŸ“ Citing

If you use the Only Connect dataset in your work, please consider citing our paper:

@article{alavi2024large,
  title={Large Language Models are Fixated by Red Herrings: Exploring Creative Problem Solving and Einstellung Effect using the Only Connect Wall Dataset},
  author={Alavi Naeini, Saeid and Saqur, Raeid and Saeidi, Mozhgan and Giorgi, John and Taati, Babak},
  journal={Advances in Neural Information Processing Systems},
  volume={36},
  year={2024}
}

πŸ™ Acknowledgements

We would like the thank the maintainers and contributors of the fan-made and run website https://ocdb.cc/ for providing the data for this dataset. We would also like to thank the creators of the Only Connect quiz show for producing such an entertaining and thought-provoking show.

About

Code accompanying the paper "Large Language Models are Fixated by Red Herrings: Exploring Creative Problem Solving and Einstellung Effect using the Only Connect Wall Dataset"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •