Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: Expected input batch_size (2400) to match target batch_size (2304) #496

Open
hudaoling opened this issue May 28, 2024 · 7 comments
Labels
bug Something isn't working

Comments

@hudaoling
Copy link

image
数据格式按照提供的格式生成的,报错,我debug了一下,就是shape大小不一样,又找不到具体是哪个输出有问题。

@hudaoling hudaoling added the bug Something isn't working label May 28, 2024
@hudaoling
Copy link
Author

image
大概率就是output的shape不对,可是不知道怎么处理。

@shibing624
Copy link
Owner

数据集的问题,需要清洗数据;用前200条数据测试跑下。

@hudaoling
Copy link
Author

我看了下以往的issue,据说macbert只支持对齐文本,长度不一致的文本不支持吗?
长度不同可否在哪里修改下代码,从而支持?

@hudaoling
Copy link
Author

另外我的使用场景是:中英文混合的语料,且正确和错误句子不等长,我纠结了好久不知道该如何处理。
因为包含了英文单词,errror_word id无法与tokennizer后的词对齐,实在是很头大。

@shibing624
Copy link
Owner

用T5模型或者大模型(如YI)

@hudaoling
Copy link
Author

T5我下载了训练样本参考,也是对齐的正确错误句子对,
如何解决句子不对齐问题呢?标注wrong_ids的时候有什么需要注意的吗?
如下图,不对齐的句子
image

另外我看了T5的训练代码,貌似没有用到wrong_ids,直接就是text to text生成文本了,对吗?
T5微调了1000条数据以后,即使是参与训练过的样本,拿出来纠错也得不到期望的结果。

@shibing624
Copy link
Owner

1.不等长的训练集,没wrong ids;2.多调试多训练。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants