- Python 3.6 or above
- numpy
- scipy
- Perl 5 or above (for SemEval evaluation)
- sklearn (for short-text classification evaluation)
- pytroch version 3.6
do
pip install -r requirements.txt
to install the requirements.
To evaluate on semantic similarity benchmarks, go to the src directory and execute
python evaluate.py -m lex -i wordRepsFile -o result.csv
-
-m option specifies the mode of operation. 'lex' to evaluate on semantic similarity benchmarks. 'ana' to evaluate on word analogy benchmarks. 'rel' to evaluate on relation classification benchmarks. 'txt' to evaluate on short text classification benchmarks. 'psy' to evaluate on pyscolinguistic score prediction benchmarks. 'pos' to evaluate on part-of-speech tagging using CoNLL-2003 dataset. You can combine multiple evaluations using a comma. For example, -m=lex,ana,rel,txt will perform all evaluations in one go.
-
-d option is used to specify a directory that contains multiple files.
-
-i specifies the input file from which we will read word representations. The file must be using the gensim format, where the first line contains vocabulary size and dimensionality in integers, separated by a space and remainder of the lines each represents the word vector for a particular word. First element in each line is the word and subsequent elements
-
-o is the name of the output file into which we will write the Pearson correlation coefficients and their significance values. This is a csv file.
-
There are several ways to compute the relational similarity between two pairs of words such as CosAdd, CosMult, PairDiff, and CosSub. This tool uses CosAdd as the default method. You can try different methods, which are also implemented in the tool. See source code for more details.
repseval depends on various packages which could be installed via pip as follows
pip install -r requirements.txt
Dataset | word pairs | Publication/distribution |
---|---|---|
Word Similarity 353 (WS) | 353 | Link |
Miller-Charles (MC) | 28 | MILLER, G. A. et CHARLES, W. G. (1991). Contextual correlates of semantic similarity. Language and Cognitive Processes, 6(1):1-28. |
Rubenstein-Goodenough (RG) | 65 | RUBENSTEIN, H. et GOODENOUGH, J. B. (1965). Contextual correlates of synonymy. Communications of the ACM, 8(10):627-633. |
MEN | 3000 | Link |
Stanford Contextual Word Similarity (SCWC) | 2003 | Link |
Rare Words (RW) | 2034 | Link |
SimLex | 999 | Link |
MTURK-771 | 771 | Link |
Dataset | instances | Publication/distribution |
---|---|---|
SAT | 374 questions | Link |
SemEval 2012 Task 2 | 79 paradigms | Link |
Google dataset | 19558 questions (syntactic + semantic analogies) | Link |
MSR dataset | 7999 syntactic questions | Link |
- There are several ways to compute the relational similarity between two pairs of words such as CosAdd, CosMult, PairDiff, and CosSub. This tool uses CosAdd as the default method. You can try different methods, which are also implemented in the tool. See source code for more details.
Dataset | word pairs | Publication/distribution |
---|---|---|
DiffVec | 12473 pairs | Link |
Dataset | word pairs | Publication/distribution |
---|---|---|
TR (Stanford Sentiment Treebank) | train = 6001, test = 1821 | Link |
MR (Movie Review Dataset) | train =, 8530 test = 2132 | Link |
CR (Customer Review Dataset) | train = 1196, test = 298 | Link |
SUBJ (Subjectivity Dataset) | train = 8000, test = 2000 | Link |
We use the input word embeddings in a neural network (containing a single hidden layer of 100 neurons and relu activation) to learn a regression model (no activation in the output layer). We use randomly selected 80% of words from MRC database and ANEW dataset to train a regression model for valence, arousal, dominance, concreteness and imageability. We then measure the Pearson correlation between the predicted ratings and human ratings and report the corresponding correlation coefficients. See Section 4.2 of this paper for further details regarding this setting.
pos.py
can be used to evaluate pretrained word embeddings for Part-of-Speech (PoS) on the CoNLL-2003 dataset. Specifically, we train an LSTM initialised with pretrained word embeddings, followed by a hidden layer (default to 100 dimensions) and a softmax layer that predicts a word into one of the 47 PoS tags. The LSTM is trained on the standard train split of the CoNLL-2003 dataset and evaluated on the standard test split of the same. Accuracy (fraction of correctly PoS predicted tokens), macro-averaged precision, recall, F scores over the 47 PoS categories are reported as the evaluation metrics.