TL;DR: Using document as reference summary in summary evaluation
Read the Background and terminology first.
To run the experiments,
-
First install the dependcies
pip install -r requirements.txt
-
(optionally) Set the path to EvalBase, the evaluation framework used by DocAsRef, if you did not install EvalBase via pip but instead cloned locally.
-
Run the experiments
python3 experiment.py
Feel free to edit the experiment configurations in
experiment.py
, which has two sections ,- The metrics to be benchmarked
- The datasets and evaluation settings.
Some metrics can be dis/enabled by directly (un)commenting correspondng lines in the
experiment.py
file. For other metrics, mostly variants of BERTScore-sentence, please (un)comment lines for their hyperparameters, e.g.,weight_schemes = ["entropy", "sum"]
for the weighting schemes of BERTScore-sentence with PageRank-style sentence weighting. The dictionary correpsonding to metrics enabled in each approach ends in the suffix_enabled
. All enabled metrics are put together in the dictionaryall_metrics_enabled
.
The code for each approach below are in their own folders.
Each folder must have a metric.py
file that defines
- Either a dictionary
metrics
which maps a string, metric name, to a callable which is a summary metric function, or - A function
create_metric()
that wraps base summary metrics with additional features to create new variant metrics.
Optionally, a folder may have an eval.py
file containing functions for defining the respective metrics.
Metrics: BERTScore, ROUGE, BLEURT, MoverScore
Implemented in /classic/metric.py
.
Initial results show that BERTScore can be very effective after being repurposed as a ref-free metric.
We propose to expand BERTScore from the token level to sentence level:
BERTScore | Our changes | |
---|---|---|
Comparison between | Token pairs | sentence pairs |
Similarity metrics | cosine | NLI-based, semantically tell whether two sentences are related, could be trained on our own tasks |
weighting scheme | IDF | semantic weighting |
The document is a list of
A memory-saving pseudocode:
for D in all_documents:
[D1, D2, D3] = sentence_segmenter(D) # break D into sentences
[E_1, E_2, ...] = sentence_embedder([D1, D2, ...]) # embed each sentence in D
for S in summaries_of_D (not all summaries, only those of D):
[S1, S2, ...] = sentence_segmenter(S) # break an S into sentences
[E'_1, E'_2, ...] = sentence_embedder([S1, S2, ...]) # embed each sentence in S
score = summary_scorer(E1, E2, ..., E'1, E'2)
Implemented in /bertscore_sentence/eval.py/compute_cos()
.
Sentence similarity measure via NLI probabilities. Implemented in mnli/eval.py/compute_mnli()
. We send a pair of sentences (one from the document and the other from the system summary) to an NLI model, selected in mnli/classifier.py
, that estimates three probabilities between the two sentences: entailing (roberta-large-mnli
, facebook/bart-large-mnli
, and microsoft/deberta-large-mnli
.
Suppose a list of system summary sentences
f1(D1, D2, ..., S1, S2, ...)
f2( f3(D1, S1, S2, ..), f3(D2, S1, S2, ..), ...., f3(Dn, S1, S2, ...) )
entropy ( sim(S1, D1), sim(S1, D2), ... ) + entropy ( sim(S2, D1), sim(S2, D2), ... )
Original BERTScore uses IDF to weight tokens. When expanding BERTScore to the sentence level, we use a PageRank-style algorithm to weight sentences.
Implemented in /pagerank
Due to the way that humans write summaries, the first few sentences in a document are more likely to be the most important ones. We use top-k (like first 3 sentences) and top-p (like the first 30% sentences) to select the first few sentences as pseudo-reference.
Implemented in top/
Instead of top-k and top-p in Approach 1.5, we use models google/pegasus-xsum
and facebook/bart-large-cnn
to generate pseudo-reference from documents.
Implemented in anyref/
- SUPERT, BLANC, SummaQA, SueNes.
- BLEU, METEOR, BART, SDC*, Sentence-Mover-Distance (SMD). See
./baseline/
folder. - GPT-3.5 Based.