Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invoice Extraction #1

Closed
vibeeshan025 opened this issue Mar 27, 2022 · 4 comments
Closed

Invoice Extraction #1

vibeeshan025 opened this issue Mar 27, 2022 · 4 comments

Comments

@vibeeshan025
Copy link

First of all, Great Work! Appreciate the novel concept of targeting multiple languages as fine-tuning.

Do you think lilt-infoxlm-base is sufficient to be used as base to train extract basic information from invoices, or do you think a completely fresh pretrained model using around 1 million samples required.

How long did it take for you create pretrained model and what's the hardware used.

On fine tuning, how many annotated invoices do you think required (Is that around 5000 sufficient) how long do you think the fine tuned model needs to be trained and what's the hardware.

Thanks in advance, additionally I have access to lot of invoices, if successful I can share the final model here.

@jpWang
Copy link
Owner

jpWang commented Mar 27, 2022

Thanks for your attention to our work.

Q1: Are these invoices monolingual or multilingual? For resource-rich languages, for example, English, LiLT+English-Roberta often performs better than LiLT+InfoXLM. Furthermore, you don't need to train a completely fresh model from scratch. For example, if your invoices are English, you can load the pre-trained LiLT+English-Roberta weight and continue to pre-train it on the unlabeled 1 million samples for a while.

Q2: Less than a week for the experimental setup described in our paper.

Q3: You can refer to the layout diversity (task difficulty), number of samples, and the SOTA performances of the public academic datasets such as FUNSD, CORD, SROIE, EPHOIE, XFUND. Generally speaking, compared with these datasets, 5000 is already a relatively sufficient number. You can also refer to our provided fine-tuning strategies and the experimental setup described in our paper.

@vibeeshan025
Copy link
Author

Thanks a lot, Currently monolingual but may be extended to other languages, I believe the LiLT+LN-Roberta method is much suited for my specific task. Is it really required to do the pre-train again with my own data from our end to achieve any better results or the fine-tuning is alone is sufficient.

First I want to see a few results before spending time and money on doing pretraining. That's why I am asking.

@jpWang
Copy link
Owner

jpWang commented Mar 27, 2022

Generally, performing fine-tuning alone can achieve a satisfactory result. But when you really want to utilize the unlabeled "in-domain" samples, or you really want to further improve the performance, you can try the strategy of continuing pre-training.

@Bhageshwarsingh
Copy link

@vincentAGNES HI, I recently came across your project. I have a few doubts and it'd be very helpful to me if you could please make some time and help me out.

@jpWang jpWang closed this as completed Jun 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants