In our paper, we formalize the linear representation hypothesis of large language models, and define a causal inner product that respects the semantic structure of language.
We confirm our theory with LLaMA-2 representations and this repo provides the code for the experiments.
In word_pairs
, each [___ - ___].txt
has counterfactual pairs of words for each concept. We use them to estimate the unembedding representations.
In paired_contexts
, each __-__.jsonl
has context samples from Wikipedia in different languages. We use them for the measurement experiment (3_measurement.ipynb
).
You need to install Python packages transformers
, torch
, numpy
, seaborn
, matplotlib
, json
, and tqdm
to run the codes. Also, you need some GPUs to implement the code efficiently.
Make a directory matrices
and run store_matrices.py
first before you run other jupyter notebooks.
1_subspace.ipynb
: We compare the projection of differences between counterfactual pairs (vs random pairs) onto their corresponding concept direction2_heatmap.ipynb
: We compare the orthogonality between the unembedding representations for causally separable concepts based on the causal inner product (vs Euclidean and random inner product)3_measurement.ipynb
: We confirm that the concept direction acts as a linear probe4_intervention.ipynb
: We confirm that the embedding representation changes the target concept, without changing off-target concepts.5_sanity_check.ipynb
: We verify that the causal inner product we find satisfies the uncorrelatedness condition in Assumption 3.3.