Incorrect generation of training data in GPL training #2751

aditya-malte · 2022-07-01T22:34:05Z

Describe the bug
Hi,
I referred this tutorial for training my own GPL model.
On closer observation, I noticed two things:

"pos" and "neg" are switched sometimes, this is especially more evident when (margin) score is negative. Also why is the score negative? shouldn't it always (or atleast mostly) be positive as the CE+>CE- a vast majority of times?.
Questions generated are sometimes totally incorrect. In the sense that they definitely appear to have been generated from one of the many documents but do not match either the neg or pos.

Expected behavior
"pos" and "neg" to not be switched at some places
AND
labels to be more accurate

Additional context
Add any other context about the problem here, like document types / preprocessing steps / settings of reader etc.

To Reproduce
Steps to reproduce the behavior

FAQ Check

Have you had a look at our new FAQ page?

System:

OS: Ubuntu
GPU/CPU: A6000
Haystack version (commit or version number): 1.5.1rc0
DocumentStore: Elastic
Reader: ..
Retriever: EmbeddingRetriever("sentence-transformers/msmarco-distilbert-base-tas-b")

aditya-malte · 2022-07-01T22:43:16Z

I have a feeling that the pseudo_label_generator.py file in haystack might be having some issues while generating training data.

julian-risch · 2022-07-04T07:16:20Z

Hi @aditya-malte we compared the results generated by Haystack's implementation of GPL to the results generated by the reference implementation and didn't find any differences. What models did you use for the question generator and the cross encoder? The ones used in the tutorial or did you change them? Did you make any changes to the tutorial, for example, did you use other data? Maybe @vblagoje can help here?

vblagoje · 2022-07-04T09:01:52Z

@aditya-malte, thanks for your report. The questions Julian posted are what I would have asked. But maybe it would be simpler if you shared your notebook so we can take a look?

vblagoje · 2022-07-13T08:55:52Z

Ping @aditya-malte , any updates? Have you noticed these issues in the GPL tutorial? Would love to hear back from you on this one.

vblagoje · 2022-09-01T08:49:09Z

I am closing this issue due to a lack of response from the issuer. We'll reopen if we discover issues with a unit test or clear proof of an issue with GPL.

julian-risch added the topic:retriever label Jul 4, 2022

vblagoje self-assigned this Jul 4, 2022

vblagoje closed this as completed Sep 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect generation of training data in GPL training #2751

Incorrect generation of training data in GPL training #2751

aditya-malte commented Jul 1, 2022 •

edited

Loading

aditya-malte commented Jul 1, 2022

julian-risch commented Jul 4, 2022

vblagoje commented Jul 4, 2022

vblagoje commented Jul 13, 2022

vblagoje commented Sep 1, 2022

Incorrect generation of training data in GPL training #2751

Incorrect generation of training data in GPL training #2751

Comments

aditya-malte commented Jul 1, 2022 • edited Loading

aditya-malte commented Jul 1, 2022

julian-risch commented Jul 4, 2022

vblagoje commented Jul 4, 2022

vblagoje commented Jul 13, 2022

vblagoje commented Sep 1, 2022

aditya-malte commented Jul 1, 2022 •

edited

Loading