Skip to content

We propose an extraction system that use knowledge of the types of the target fields to generate extraction candidates, and a neural network architecture that learns a dense representation of each candidate based on neighbouring words in the document. These learned representations are not only useful in solving the extraction task for unseen doc…

License

Notifications You must be signed in to change notification settings

Luckygyana/Invo-AI

Repository files navigation

Invo-AI: An automatic E-Invoicing using Image Processing

We propose an extraction system that use knowledge of the types of the target fields to generate extraction candidates, and a neural network architecture that learns a dense representation of each candidate based on neighbouring words in the document. These learned representations are not only useful in solving the extraction task for unseen document templates from two different domains, but are also interpretable in classic document processing.

Our Project: Source Code | Youtube |

AUTHOR CONTRIBUTIONS

Subject:        Invo-AI
Topic:           Electronic Invoicing using Image Processing
Assignment:   Source Code (Final Presentation)
Authors:        GYANENDRA DAS
Date:           July 8,2020


Developing an Automatic Invoice Extraction system using Python

Features!

  • Extract Information from Scanned Invoices to a XML file
  • Multilanguage Support

Our Approach

Our complete Model works in the following eight steps:

  • Convert PDF to JPG
  • Detecting Bbox for all Text
  • Bbox_mapper extract contours, sort them in all manners and extract text from them.
  • Recognition of Text using OCR-Tesseract LSTM
  • Ensemble Searching the keyword to locate Table Header
  • Segregate the Image into Info (Non-Table Part) and Sheet (The item Table)
  • Direct filling the value of Sheet in the XLS file.
  • Searching Key to extract Info values and mapping it in the XLS File.

Convert PDF to JPG

For This Task We use [https://pypi.org/project/pdf2image/]

Detecting Bbox for All Text

  • Binary images in ExtractStructure class for image processing

Bbox_mapper extract contours, sort them in all manners and extract text from them in sequential Order


Optimization Techniques incorporated to extract the Grid Structure Class with higher accuracy involves:

  • Gaussian Blur
  • getStructuringElement (to get Kernel size)
  • Dilate
  • Erode
  • Convolution

This is main task in overall Process. For this task we applied Three methods.

  • Use Pytesseract [https://pypi.org/project/pytesseract/]
    • Advantage : Every Word is Detecting and creating one and more bbox per an word
    • Disadvantage : It cannot be able to detect Semantic pair with one bbox like Invoice no and its value is in different bbox
  • Use EfficientDet [https://arxiv.org/abs/1911.09070]
    • Advantage : We tried to divide the image to some classes like Shipping, Buying, Header,Footer, Table.
      • For this task we used our own labeled dataset of around 2500 images.
      • With a good training pipeline effdet d5 we able to acheive good loss of 0.83
      • We added WBF [https://arxiv.org/abs/1910.13302] and get loss of 0.42
    • Disadvantage : We cannot detect line and words because lack of data
  • Use CRAFT Model [https://arxiv.org/pdf/1904.01941.pdf]
    • Advantage : We can detect the lines and semantic pair both adjusting the best threshold
    • Disadvantage : There is no disadvantage but model should need to be optimized

combine

Lastly We Used CRAFT Model for it's effictive ness with less data.

CAN BE IMPROVED MORE: with using both CRAFT AND EFFDET we can know which text belong to which box and we need not to other processing

Recognition of Text

Our model demonstrate a higher accuracy with use of Transfer Learning of OCR-Tesseract LSTM on our annotated dataset and can be highly scalable to our documents with small amount of labeled training.

For this task aslo we applied two methods:

Lastly We Used Pytesseract for it's effictiveness with this sample data.

About

We propose an extraction system that use knowledge of the types of the target fields to generate extraction candidates, and a neural network architecture that learns a dense representation of each candidate based on neighbouring words in the document. These learned representations are not only useful in solving the extraction task for unseen doc…

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published