Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Layoutlm not classifying bottom half of documents #935

Open
DoubtfulCoder opened this issue Nov 30, 2022 · 5 comments
Open

Layoutlm not classifying bottom half of documents #935

DoubtfulCoder opened this issue Nov 30, 2022 · 5 comments

Comments

@DoubtfulCoder
Copy link

Describe
Model I am using (UniLM, MiniLM, LayoutLM ...): Layoutlm

I am trying to use Layoutlm for resume parsing. I've labeled and trained on over 100 resumes and am currently reaching an F1 score of around 0.55 and accuracy around 85%. However, when I run inference, many of the documents have large portions (clustered at the bottom) of the text left unclassified. The resumes that I've run inference on are in a similar format to those trained on and should have similar locations of bounding boxes. Why is layoutlm not classifying them? If it's overfitting, what can I do about it?

Example (blurred for personal info):
image

@wolfshow
Copy link
Contributor

@DoubtfulCoder, That's not overfitting. The reason is LayoutLM processes the document in a windows size of 512 tokens. If your document is longer than 512 tokens, you need to split the page into multiple samples for the model to process, both for training and testing.

@DoubtfulCoder
Copy link
Author

DoubtfulCoder commented Dec 1, 2022

Thanks for your help @wolfshow . By tokens, do you mean just words or do they refer to something else as well?

How can I handle this length limit in training and in inference? I see some things about sliding window approach by moving 128 characters at a time. Can you provide some example code?

@davelza95
Copy link

davelza95 commented Dec 7, 2022

I have a similar issue, I have tried to change the seq_max_length, but I got a Cuda error in training.

I have tried changing the max_position_embeddings = 1024, and the bboxes size (1024+196+1, 4) after I tokenized them, but this hasn't worked.

Note: Why 196 + 1

Can someone help me, please.

@davelza95
Copy link

Thanks for your help @wolfshow . By tokens, do you mean just words or do they refer to something else as well?

How can I handle this length limit in training and in inference? I see some things about sliding window approach by moving 128 characters at a time. Can you provide some example code?

Hi! Did you fix it ?

@DoubtfulCoder
Copy link
Author

Thanks for your help @wolfshow . By tokens, do you mean just words or do they refer to something else as well?
How can I handle this length limit in training and in inference? I see some things about sliding window approach by moving 128 characters at a time. Can you provide some example code?

Hi! Did you fix it ?

Hi, I did not try increasing max_position_embeddings but just used a sliding window approach. Basically, if the number of words is greater than 315, just slides windows of 100 characters (0-300, 100-400, etc.) and then aggregate the predictions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants