Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rescale and specify certain model #46

Closed
areejokaili opened this issue May 7, 2020 · 11 comments
Closed

rescale and specify certain model #46

areejokaili opened this issue May 7, 2020 · 11 comments

Comments

@areejokaili
Copy link

Hi
Thank you for making your code available.
I have used your score before the last update (before muti-refs were possible and before scorer). I used to get the hash of the model to make sure I get the same results always.
With the new update, I'm struggling to find how to set a specific model and also rescale.

For example, would like to do like this
out, hash_code= score(preds, golds, model_type="roberta-large_L17_no-idf_version=0.3.0(hug_trans=2.5.0)", rescale_with_baseline= True, return_hash=True)

roberta-large_L17_no-idf_version=0.3.0(hug_trans=2.5.0) is the hash I got from my earlier runs couple of months ago.

Appreciate your help
Areej

@Tiiiger
Copy link
Owner

Tiiiger commented May 7, 2020

hi @areejokaili , sorry for the confusion.

The code below should meet your use case.

out, hash_code= score(preds, golds, model_type="roberta-large", rescale_with_baseline= True, return_hash=True)

@areejokaili
Copy link
Author

areejokaili commented May 7, 2020

roberta-large

Hi @Tiiiger, thanks for the quick reply.
Tried your provided code but It required lang='en'.

scorer = BERTScorer(model_type='roberta-large', lang='en', rescale_with_baseline=True)

It works now, but I'm getting different scores than before. I was doing my own multi-refs scoring before, so maybe this is why.
I'll investigate more

@Tiiiger
Copy link
Owner

Tiiiger commented May 7, 2020

were you using baseline rescaling before? according to the hash you were not?

@areejokaili
Copy link
Author

areejokaili commented May 7, 2020

this is what I used before
score([p], [g], lang="en", verbose=False, rescale_with_baseline=True)
and this is the hash actually
roberta-large_L17_no-idf_version=0.3.0(hug_trans=2.5.0)-rescaled

@Tiiiger
Copy link
Owner

Tiiiger commented May 7, 2020

Cool, that looks correct. Let me know if you have any further question.

@Tiiiger Tiiiger closed this as completed May 7, 2020
@areejokaili
Copy link
Author

Hi @Tiiiger again,

sorry for asking again but I did a dummy test to compute the similarity between 'server' and 'cloud computing' using two different environments.

First env has bert-score 0.3.0, transformers 2.5.0 and got scores 0.379 0.209 0.289
hash --> roberta-large_L17_no-idf_version=0.3.0(hug_trans=2.5.0)-rescaled

The second env, has bert-score 0.3.2, transformers 2.8.0 and got scores -0.092, -0.167 -0.128
hash --> roberta-large_L17_no-idf_version=0.3.2(hug_trans=2.8.0)-rescaled
In both cases I have used the following
P, R, F= score(preds, golds,lang='en', rescale_with_baseline=True, return_hash=True)
I would like to use bert-score 0.3.2 for the multi-refs feature but would like to maintain the same scores as I got before.
Would appreciate any insight why I'm not getting the same score

@Tiiiger
Copy link
Owner

Tiiiger commented May 8, 2020

hi @areejokaili , thank you for letting me know. I suspect that there could be some bugs in the newer version and I would love to fix those.

I am looking into this.

@Tiiiger Tiiiger reopened this May 8, 2020
@Tiiiger
Copy link
Owner

Tiiiger commented May 8, 2020

hi I quickly tried a couple of environments. Here are the results:

> score(['server'], ['cloud computing'],lang='en', rescale_with_baseline=True, return_hash=True)
((tensor([-0.0919]), tensor([-0.1670]), tensor([-0.1279])),
 'roberta-large_L17_no-idf_version=0.3.0(hug_trans=2.8.0)-rescaled')
> score(['server'], ['cloud computing'],lang='en', rescale_with_baseline=True, return_hash=True)
((tensor([0.3699]), tensor([0.2090]), tensor([0.2893])),
 'roberta-large_L17_no-idf_version=0.3.0(hug_trans=2.5.0)-rescaled')

I believe this is due to an update in the RoBERTa tokenizer.

Running transformers=2.5.0, I got this warning:

RobertaTokenizerFast has an issue when working on mask language modeling where it introduces an extra encoded space before the mask token.See https://github.com/huggingface/transformers/pull/2778 for more information.

I encourage you to checkout issue 2778 to understand this change.

So, as I understand, this is not a change in our software. If you want to keep the same results as before, then you should downgrade transformers==2.5.0. However, I believe the behavior in transformer==2.8.0 is more correct. It's your call and it really depends on your use case.

Again, thank you for giving me the heads-up. I'll add a warning to our README.

@Tiiiger Tiiiger closed this as completed May 8, 2020
@areejokaili
Copy link
Author

areejokaili commented May 11, 2020

Hi @Tiiiger
Thanks for letting me know. I have updated both libraries and will go with Transformers 2.8.0.
I have one more question and would appreciate clarifying what I'm missing here

cands=['I like lemons.']

refs = [['I am proud of you.','I love lemons.','Go go go.']]

(P, R, F), hash_code = score(cands, refs, lang="en", rescale_with_baseline=True, return_hash=True)
P, R, F = P.mean().item(), R.mean().item(), F.mean().item()

print(">", P, R, F)
print("manual F score:", (2 * P * R / (P + R)))

--- output ---

> 0.9023454785346985 0.9023522734642029 0.9025075435638428
manual F score: 0.9023488759866588

Do you know why the F score directly from the method is different than when I do it manually?
Thanks again

@felixgwu
Copy link
Collaborator

Hi @areejokaili,

The reason is that you are using rescale_with_baseline=True.
The raw F score is computed using the raw P and R, and then rescaled based on the F baseline score. P and R are also rescaled independently based on their own baseline scores as well.

@areejokaili
Copy link
Author

areejokaili commented May 11, 2020

Thanks @felixgwu
Could you check this please

cands=['I like lemons.', 'cloud computing']
refs = [['I am proud of you.','I love lemons.','Go go go.'],
        ['calculate this.','I love lemons.','Go go go.']]
print("number of cands and ref are", len(cands), len(refs))
(P,R,F), hash_code = score(cands, refs, lang="en", rescale_with_baseline=False, return_hash=True)
P, R, F = P.mean().item(), R.mean().item(), F.mean().item()

print(">", P, R, F)
print("manual F score:", (2 * P * R / (P + R)))

output

> 0.9152767062187195 0.9415446519851685 0.9280155897140503
manual F score: 0.9282248763666026

Appreciate the help,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants