US20090116736A1 - Systems and methods to automatically classify electronic documents using extracted image and text features and using a machine learning subsystem - Google Patents
Systems and methods to automatically classify electronic documents using extracted image and text features and using a machine learning subsystem Download PDFInfo
- Publication number
- US20090116736A1 US20090116736A1 US12/266,462 US26646208A US2009116736A1 US 20090116736 A1 US20090116736 A1 US 20090116736A1 US 26646208 A US26646208 A US 26646208A US 2009116736 A1 US2009116736 A1 US 2009116736A1
- Authority
- US
- United States
- Prior art keywords
- document
- documents
- job
- automatically
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/19—Recognition using electronic means
- G06V30/196—Recognition using electronic means using sequential comparisons of the image signals with a plurality of references
- G06V30/1983—Syntactic or structural pattern recognition, e.g. symbolic string recognition
- G06V30/1985—Syntactic analysis, e.g. using a grammatical approach
Definitions
- the paper documents are scanned in a random, unorganized sequence, which makes it difficult and time-consuming to find a particular page within the electronic document.
- One solution can be to manually organize the paper documents prior to scanning; however, the individual organizing the paper documents or performing the scanning may not have the skill, knowledge or time needed to correctly organize the paper documents. Additionally, organizing the paper documents prior to scanning can be very time-consuming and expensive. Further, organizing the pages prior to scanning might properly order the pages, but it does not generate a table of contents, metadata, bookmarks or a hierarchical index that would facilitate finding a particular page within the complete set of pages.
- the recipient may want the pages of the electronic document organized in a specific order to facilitate finding a particular page in timely and inexpensive manner.
- an assistant may scan forty pages of tax documents in a random order and save the result in an electronic document.
- an accountant may then need to organize the pages of the electronic document in a specific order so that navigating through the electronic document during the preparation and review of an income tax return can be performed in an accurate and efficient manner.
- One way that the recipient of an electronic document can organize the pages is by using the thumbnail, metadata and/or bookmark features of the document editing software program.
- Manually organizing an electronic document including typing a table of contents, metadata, bookmarks or a hierarchical index, is time-consuming and expensive.
- Manual organization tends to be ad-hoc, failing to deliver a standardized table of contents, metadata, bookmarks or a hierarchical index for the electronic document.
- Another way that the recipient of an electronic document can organize the pages is by using software that assists in manually categorizing document pages.
- the software provides a user a pre-identified set of types of documents and associates each page with the type selected by the user. This approach requires the recipient to manually categorize each page, a time-consuming and expensive process.
- a document analysis system that automatically classifies documents by recognizing in each document distinctive features that have been automatically learned by the system, so that the system may organize jobs according to the categories of documents the job contains.
- the document analysis system includes a document acquisition system, a document recognition training system, a document classification system, a document recognition system, and a job organization system.
- the document acquisition system receives jobs from a plurality of users, each job containing at least one electronic document having at least one page that includes image aspects and text.
- the document feature recognition system automatically extracts image and text features from each received electronic document.
- the document classification system automatically classifies recognized electronic documents as belonging to a corresponding category of document by finding the best match between the extracted features of the document and feature sets associated with each category of document, in which each feature set includes a set of image and text features and corresponding weights that distinguishes the respective category of document from the other categories of documents.
- the document recognition training system automatically trains the feature set for each corresponding category of documents, the training system using extracted features of unrecognized electronic documents to automatically modify the feature set for a document category so that the ability of the document classification system to automatically classify documents improves as the training system is subjected to more and more unrecognized documents and the feature sets are modified accordingly.
- the job organization system automatically organizes each job according to the categories of documents it contains by organizing electronic documents associated with each job based on at least one business rule that corresponds to the categories of documents.
- FIG. 1 is a system diagram of a document identification and classification system 100 according to a preferred embodiment of the disclosed subject matter
- FIG. 2 is a system diagram of the image capture system 110 according to a preferred embodiment of the disclosed subject matter
- FIG. 3 is an illustration of three-step document submission process according to a preferred embodiment of the disclosed subject matter
- FIG. 4 is a system diagram of the server system 120 according to a preferred embodiment of the disclosed subject matter
- FIG. 5 is a system diagram of the production system 410 according to a preferred embodiment of the disclosed subject matter
- FIG. 6 is a system diagram of the training system 530 according to a preferred embodiment of the disclosed subject matter
- FIG. 7 is a system diagram of the image processing system 546 according to a preferred embodiment of the disclosed subject matter.
- FIG. 8 is an illustration of an example of nine types of point patterns according to a preferred embodiment of the disclosed subject matter
- FIG. 9 is a system diagram of the classification system 556 according to a preferred embodiment of the disclosed subject matter.
- FIG. 10 illustrates an organized document development system according to a preferred embodiment of the disclosed subject matter
- FIG. 11 illustrates an example of bookmarked document according to a preferred embodiment of the disclosed subject matter
- FIG. 12 is a flow diagram of the service control manager 526 according to a preferred embodiment of the disclosed subject matter.
- FIG. 13 illustrates an exemplary computer system on which the described invention may run according to a preferred embodiment of the disclosed subject matter.
- Preferred embodiments of the present invention provide a method and system for converting paper and digital documents into well-organized electronic documents that are indexed, searchable and editable.
- the resulting organized electronic documents support more rapid and accurate data entry, retrieval and review than randomly sequenced sets of pages.
- FIG. 1 is a system diagram of a document identification and classification system 100 according to a preferred embodiment of the invention.
- System 100 has an image capture system 110 and a server system 120 .
- the image capture system is connected to the production servers by a network such as a local-area network (LAN,) a wide-area network (WAN) or the Internet.
- the preferred implementation transfers all data over the network using Secure Sockets Layer (SSL) technology with enhanced 128-bit encryption.
- SSL Secure Sockets Layer
- Encryption certificates can be purchased from well respected certificate authorities such as VeriSign and that or can be generated by using numerous key generation tools in the market today, many of which are available as open source.
- the files may be transferred over a non-secure network, albeit in a less secure manner.
- SSL Secure Sockets Layer
- System 110 is an image capture system that receives physical documents and scans them.
- the image capture system is described in greater detail below.
- System 120 is a server system that receives the scanned documents over the Internet. Once received, the server system organizes the classified pages per a predetermined scheme into a new, organized document.
- the server system includes a mechanism for learning documents. The server system is described in greater detail below.
- FIG. 2 is system diagram of the image capture system 110 according to a preferred embodiment of the invention.
- System 110 has a source document 210 , a user interface system 220 , a data transfer system 230 , a scanning system 215 , a data acquisition system 225 and an image processing system 235 .
- Source documents in the form of papers are physically placed on an input tray of a commercial scanner.
- Source documents in the form of data files are received over a network by the user interface system.
- the user interface system communicates with the data transfer system via software within a computer system.
- the user interface system communicates with the data acquisition system via software within a computer system.
- the data acquisition system communicates with the scanning system via a physical connection, such as a high-speed Universal Serial Bus (USB) 2.0, or, optionally, over a network.
- USB Universal Serial Bus
- Element 210 is a source document in the form of either one or more physical papers or a digital file containing images of one or more papers.
- the digital file can be in one of many formats, such as PDF, TIFF, BMP, or JPEG.
- System 220 is a user interface system.
- the user interface system runs in a browser and presents a user with a three-step means for submitting documents to be organized as shown in FIG. 3 .
- the user interface system provides a mechanism for selecting a job from a list of jobs; additionally, it allows jobs to be added to the job list.
- the user interface system provides a mechanism for initiating the scanning of physical papers; additionally, is provides a browsing mechanism for selecting a file on a computer or network.
- one or more sets of papers can be scanned and one or more files can be selected.
- the user interface system provides a mechanism for sending the job information and selected documents over a network to the server system.
- the user interface system also presents a user with the status of jobs that have been submitted as submitted or completed; optionally, it presents the expected completion date and time of submitted jobs that have not been completed.
- the user interface system also presents a user with a mechanism for receiving submitted documents and organized documents.
- the user interface system also provides a mechanism for deleting files from the system.
- System 230 is a data transfer system.
- the data transfer system manages the SSL connection and associated data transfer with the server system.
- System 215 is a scanning system.
- conventional scanning systems may be used such as those from Bell+Howell, Canon, Fujitsu, Kodak, Panasonic and Xerox.
- the scanning system captures an image of the scanned document as a computer file; the file is often in a standard format such as PDF, TIFF, BMP, or JPEG.
- the data acquisition system 225 is a data acquisition system.
- the data acquisition system controls the settings of the scanning system.
- Many scanning systems in use today require users to manually set scanner settings so that images are captured, for example, at 300 dots per inch (dpi) as binary data (black-and-white.)
- Commercial scanners and scanning software modify the original source document image that often include high resolution and, possibly, color or gray-scale elements. The resolution is often reduced to limit file size.
- Color and gray-scale elements are often binarized, e.g. converted to black or white pixels, via a process known as threshholding, also to reduce file size.
- the data acquisition system sets the scanning system to scan pages double-sided at 300 dpi with eight bits of gray scale.
- the data acquisition system commands the scanning system to begin operation and receives the scanned document computer file from the scanning operation.
- the image processing system 235 is an optional image processing system.
- the image processing system enhances the image quality of scanned images for a given resolution and other scanner settings.
- the image processing system may be implemented as part of the image capture system as depicted on FIG. 2 or as part of the server system as depicted on FIG. 8 . Details of the image processing system are described in further detail below as part of the server system.
- FIG. 4 is a system diagram of the server system 120 according to a preferred embodiment of the invention.
- System 120 has a production system 410 , a system of databases 412 and a training system 414 .
- the production system is connected to the system of databases by a gigabit Ethernet connection.
- the training system is connected to the system of databases by a gigabit Ethernet connection.
- the production system classifies each of the pages in the document as one of a pre-identified set of types of documents.
- the production system organizes the classified pages per a predetermined scheme into a new, organized document.
- the production system stores the original scanned document and the organized document. The production system is described in greater detail below.
- the system of databases is comprised of a content repository, a job database, a class database and a training database.
- the system of databases is described in greater detail below.
- the training system utilizes supervised learning to provide a growing set of documents with characterized feature sets to the class database.
- the training system is described in greater detail below.
- FIG. 5 is a system diagram of the production system.
- System 410 has a web services system 510 , an application server 512 , a document management system and business entities (DMS) 514 , a LDAP authentication system 516 , a content repository 518 , a job database 528 , a service control manager 526 , an image processing system 546 , a classification system 556 , a success evaluation system 554 , an organized document development system 564 , a training database 532 , a training system 530 and a class database 558 .
- DMS document management system and business entities
- FIG. 5 shows the overall system that represents the server-side operation of the automatic organization of electronic documents.
- the system is comprised of several modules that help in automating and improving the accuracy of the results.
- the system may be built in a highly distributed architecture and consists of several independent processes, data repositories and databases which communicate and pass messages to each other via well defined standard and proprietary interfaces. Even though the system may be built in a loosely coupled manner to achieve maximum scalability and throughput, the same results can be achieved if the system was more tightly coupled in a single process with each module being a logical entity of the same process. Furthermore, the design of the system considers multiple different product types which may need to process anywhere from hundreds to millions of documents every day for tens to thousands of customers in different markets.
- Another advantage of the above system design is that it allows the server(s) to be hosted in a secure data center. Documents from healthcare, insurance, banking, government, tax and other applications which will go through the recognition and organization processing system will need security applied per policies that are HIPAA, GLBA, SAS70, etc. compliant.
- the web services 510 system provides the server system connection to the network that interfaces with the image capture system.
- a network could be a local-area network (LAN), a wide-area network (WAN) or the Internet.
- SSL Secure Sockets Layer
- Standard web services include Apache, RedHat JBoss Web Server, Microsoft IIS, Sun Java System Web Server, IBM Websphere, etc.
- the primary web service requirement is that the module should be able to handle multiple HTTP or HTTPS requests from different users as they upload their source electronic documents or download their organized electronic document, in a secure manner.
- the web service should also be able to relay any necessary parameters to the application servers which will process the electronic document.
- the application server 512 provides necessary clustering, caching, load-balancing and persistence of the application for a distributed deployment of large scalable enterprise applications.
- the application layer manages transaction context as documents are uploaded and downloaded to the system and maintains all necessary service integrity.
- the application server also provides messaging services, mail services, security services, and connection pool, all of which make the service available to handle a large number of requests simultaneously.
- the document management system (DMS) 514 and the business object layer capture the business entities.
- the DMS is generally a computer-based system or set of servers used to track and store electronic documents and/or images of paper documents.
- the DMS also commonly provide storage, versioning, metadata, security, as well as indexing and retrieval capabilities. Simple functions could include adding or retrieving an electronic document of a user.
- System 514 handles complex business hierarchies for a large number of users across multiple organizations. This is achieved by designing appropriate business objects that access the data access objects (DAO) with appropriate privileges and permissions.
- DAO data access objects
- the data access objects implement the access mechanism to the data sources.
- the data source could be a persistent store like an RDBMS, an external service like a B2B exchange, a repository such as the LDAP database of System 516 , or a business service accessed via CORBA Internet Inter-ORB Protocol (IIOP) or low-level sockets.
- the business component that relies on the DAO uses the simpler interface exposed by the DAO for its clients.
- the DAO completely hides the data source implementation details from its clients. Because the interface exposed by the DAO to clients does not change when the underlying data source implementation changes, this pattern allows the DAO to adapt to different storage schemes without affecting its clients or business components. Essentially, the DAO acts as an adapter between the component and the data source.
- the authentication system 516 allows secure and authorized access to the content repository.
- an LDAP authentication system is used; however, other authentication systems can also be used.
- an LDAP server is used to process queries and updates to an LDAP information directory. For example, a company could store all of the following very efficiently in an LDAP directory:
- document organization and access rights are managed by the access control privileges stored in the LDAP repository.
- the content repository 518 can be simple file system, a relational database or an object oriented database. Under a preferred embodiment, the content repository is based on Java Standard Requests 170 (JSR 170 ). JSR 170 is a standard implementation-independent way to access content bi-directionally on a granular level within a content repository.
- the content repository is a generic application “data store” that can be used for storing both text and binary data (images, word processor documents, PDFs, etc.).
- data store One key feature of a content repository is that one does not have to worry about how the data is actually stored: data could be stored in a relational database (RDBMS) or a file system or as an XML document.
- RDBMS relational database
- most content repositories provide advanced services such as uniform access control, searching, versioning, observation, locking, and more.
- documents in the content repository are available to the end user via a portal.
- the user can click on a web browser application button “View Source Document” in the portal and view the original scanned document over a secure network.
- the content repository can become an off-site secure storage facility for the user's electronic documents.
- the job database 528 is used to receive, then process and finally post the user's job back to the content repository.
- a “job” is defined as the steps of automatically organizing the electronic document from their original scanned images.
- Module 649 can be file system storage, a relational database, XML document or a combination of these. In the current implementation, the system uses both file system storage to store large blob (binary large objects) and a relational database to store pointers to the blobs and other information pertinent to processing the job.
- the service control manager (SCM) 526 is a system that controls the state machine for each job.
- the state machine identifies the different states and the steps that a job has to progress through to achieve its final objective, in this case being an organized electronic document.
- the SCM is designed to be highly scalable and distributed.
- the SCM is multi-threaded to handle hundreds of jobs at any given time. It also implements message queues to communicate with other processes regarding their own states.
- the SCM can be implemented in other architectures as well. For example, one can implement a complete database driven approach to step through all the different steps required to process such a job.
- the SCM subscribes to events for each new incoming job that need to be processed.
- the SCM pre-processes the job by taking the electronic document and separating each image (or page) into its own bitmap image for further processing. For example, if an electronic document had 30 pages, the system will create 30 images for processing. Each job in the system is given a unique identity. Furthermore, each page is given a unique page identity that is linked to the job identity. After the SCM has created image files by pre-processing the document into individual pages, it transitions the state of each page to image processing.
- the image processing system 546 removes noise from the page image and properly orients the page so that document image analysis can be performed more accurately.
- the accuracy of the document recognition greatly depends on the quality of the image; thus image processing is included under preferred embodiments.
- the image processing system performs connected component analysis and, utilizing a line detection system, creates “confetti” images which are small sections of the complete page image. Under preferred embodiments, the confetti images are accompanied the coordinates of the image sub-section. The image processing system is discussed in greater detail below.
- the classification system 556 recognizes the page as one of a pre-identified set of types of documents.
- a major difficulty in categorizing a page as one of a large number of documents is the high dimensionality of the feature space.
- Conventional approaches that depend on text categorization alone are faced with a native feature space that consists of many unique terms (words as well as phrases) that occur in documents, which can be hundreds or thousands of terms for even a moderate-sized collection of unique documents.
- multiple systems that categorize income tax documents such as W-2, 1099-INT, K-1 and other forms have experienced poor accuracy because of the thousands of variations of tax documents.
- the preferred implementation uses a combination of image pattern recognition and text analysis to distinguish documents and machine learning technology to scale to large numbers of documents. The classification system is described in greater detail below.
- the class database 558 contains the trained set of information produced and used by the systems learning engine. As the system grows “smarter” by recognizing more classes and variations of documents, the class database grows. As the machine learning system sees more trained documents, its classification accuracy increases.
- the success evaluation system 554 determines how the document is treated once the classification process has been completed. If the classification system successfully classifies the document, the document is directed to System 564 , the organized document development system described below. If the classification system fails to recognize the document with a high level of confidence, the document is directed to System 530 , a training system, described below.
- the training system 530 performs computations on the data in its document database corresponding to the classification systems that are in place and generates datasets used by the classification system for recognizing source documents.
- the results of the training and re-training process are classification datasets that are updated in the class database. The training system is described in greater detail below.
- the system implements a continuous learning process in which a document that is not accurately identified is sent for training. Training results in an expanded data set in the class database 558 , thereby improving the accuracy of the system over time. As the class database grows, the system requires an asymptotically lower percentage of documents to be trained.
- FIG. 6 is a system diagram of the training system.
- System 530 has a manual class identification and correction system, a feature set computation system, a collision testing system, a training process manager, a classifier system, and a classification update system.
- Preferred implementations use machine learning supported by the training system that adapts to a growing set of documents. Additional documents add additional features that must be analyzed. Preferred implementations of the training system include tuning and optimization to handle noise generated during both the training phase and the testing phase. The training phase is also called learning phase since the parameters and weights are tuned to improve the learning and adaptability of the system by fitting the model that minimizes the error function of the dataset.
- the learning technique in the preferred implementation is supervised learning.
- Applications in which training data comprises examples of input vectors along with their corresponding target vectors are known as supervised learning problems.
- Example input vectors include key words and line patterns of the document layouts.
- Example target vectors include possible classes of output in the organized document.
- Supervised learning avoids the unstable states that can be reached by unsupervised learning and reinforcement learning systems.
- the learning system receives documents into a training database 614 from the success evaluation system. These documents are not trained and do not have corresponding classification model data in the class database. All such documents are made persistent in the training database.
- the training database 614 has several tables which contain the document class information as well as image processing information (which is discussed in greater detail below.) The following tables are part of training database:
- the training process manager 612 is a system that manages the distribution of the training task.
- a user called a “trainer” logs into the system in which the trainer has privileges at one of three levels.
- the training process manager directs document processing based on the document state:
- the form class state is changed to trained, not synched if allowed by policy.
- the document class has the following states:
- the manual identification and text correction system processes each document that requires training.
- the trainer follows two independent steps:
- the feature set computation system 620 combines the document image, the manually classified information and the corresponding text.
- the feature set computation system is the point-pattern matching data which is described in greater detail below.
- the classifier system 622 creates a global dictionary (GD) and global priority (GPr) of a word based on the words it receives from each document after OCR correction.
- GD global dictionary
- GPr global priority
- GPr MPr*log(total no. of classes/total no. classes in which the word appears)
- GPr is the global priority of the word and MPr is the mean priority of the word among the classes it is presented. For example, if word 1 is present in class 1 with priority p 1 , in class 2 with priority p 2 , in class 3 with priority p 3 , then:
- a particular word can be determined to be a stop word or not.
- the preferred implementation stores all the word in a Trie data structure to maximize search speed.
- a trellis of bi grams of letters is built which is used during prediction of a letter during document OCR.
- the collision testing system 630 performs tests across a large set of trained documents to ensure that previously trained documents do not collide or break with the newly added information. Under preferred embodiments, a large regression suite is built as part of the collision testing system.
- the classification update system 632 inserts new trained data that passes regression testing into the class database.
- FIG. 7 is a system diagram of the image processing system.
- System 546 has a binarization system, a noise cleanup system, a skew correction system, an orientation correction system, a connected component analysis system, a text line detection system, a confetti generation system, a confetti storage system, a feature identification system and a feature storage system.
- Source document images can have salt-pepper noise, skew, orientation in any direction and/or color or gray scale elements.
- a document can be captured as a color, grayscale or binary image by a scanning device. Common problems seen in images from scanning devices include:
- the preferred embodiment of the binarization system utilizes local threshholding where the threshold value varies based on the local content in the document image.
- the preferred implementation is built on a adaptive threshholding technique which exploits local image contrast (reference: IEICE Electronics Express , Vol. 1, No 16, pp. 501-506.)
- the adaptive nature of this technique is based on flexible weights that are computed based on local mean and standard deviations calculated for the gray values in the primary local zone or window.
- the preferred embodiment experimentally determines optimum median filters across a large set of document images for each application space.
- the noise cleanup system removes dots, specks and blobs from documents.
- minimum and maximum dot size are specified.
- the preferred embodiment also performs image reversal so that white text or line objects on black backgrounds are detected and inverted to black-on-white.
- the preferred embodiment also performs two noise removal techniques.
- the first technique starts with any small region of a binary image.
- the preferred implementation takes a 35 ⁇ 35 pixel region. In this region all background pixels are assigned value “0.” Pixels adjacent to background are given value “1.” A matrix is developed in this manner. In effect each pixel is given a value called the “distance transform” equal to its distance from the closest background pixel.
- the preferred implementation runs a smoothing technique on this distance transform. Smoothing is a process by which data points are averaged with their neighbors in a series. This typically has the effect of blurring the sharp edges in the smoothed data. Smoothing is sometimes referred to as filtering, because smoothing has the effect of suppressing high frequency signals and enhancing low frequency signals. Of the many different methods of smoothing, the preferred implementation uses a Gaussian kernel.
- the preferred implementation performs Gaussian smoothening with a filter using variance of 0.5 and a 3 ⁇ 3 kernel or convolution mask on the distance transform. Threshholding with a thresholding value of 0.85 is performed on the convolved images and the resulting data is converted to its binary space. This method has been tested across a large number of noisy documents and the results have been found to be good.
- the second technique uses connected component analysis (discussed in greater detail below) to identify small or bad blocks.
- a sliding mask is created of a known size.
- the preferred implementation uses a mask that is 35 ⁇ 35 pixels wide. This mask slides over the entire image and is used to detect the number of blobs (connected components) that are less than 10 pixels in size. If the number of blobs is greater than five, then all blobs are removed. This process is repeated by sliding the mask over the entire image.
- the skew correction system fixes small angular rotations of the entire document image. Skew correction is important for the document analysis module because it improves text recognition, simplifies interpretation of page layout, improves baseline determination, and improves visual appearance of the final document.
- image processing libraries do skew correction.
- the preferred implementation of skew detection is part of the open source Leptonica image processing library.
- the orientation correction system aligns document images so that they can be most easily read.
- Documents originally in either portrait or landscape format may be rotated by 0, 90, 180 or 270 degrees during scanning. There are three preferred implementations of orientation correction.
- the first method detects blocks of text in the image and measures each with respect their block height and width.
- the average width is more than average height.
- An average count of the width and height is performed and if the width to height ratio is above a certain threshold, the document is determined to be portrait or landscape.
- the second method performs a baseline analysis, counting the pixels in ascenders and descenders along any line in a document. Heuristically, the number of ascenders is found to be more than the number of descenders in English language documents that are correctly oriented. The document is oriented so that ascenders outnumber descenders.
- the third method performs OCR is on small words or phrase images at all four orientations: 0, 90, 180 and 270 degrees. Small samples are selected from a document and the confidence is averaged across the sample. The orientation that has the highest confidence determines the correct orientation of the document.
- the connected component analysis system implements a very standard technique.
- the open source Image Processing Library 98 (IPL98) is used for connected component analysis.
- the text line detection system implements a technique described by Okun et al. (reference: “Robust Text Detection from Binarized Document Images”) to identify candidate text segments blocks of consistent heights. For a page from a book, this method may identify a whole line as a block, while for a form with many boxes this method will identify the text in each box.
- the confetti generation module identifies all the coordinates of the blocks.
- the confetti storage system stores the confetti information in the appropriate persistent storage.
- the job database stores confetti information for each job.
- the feature identification system looks for point and line features.
- the preferred implementation performs image layout analysis using two image properties, the point of intersection of lines and edge points, as shown in FIG. 8 , of text paragraphs. Every unique representation of points is referred as a unique class in the system and represents a unique point pattern in the system database.
- the preferred implementation uses a heuristically developed convolution method only on black pixels to perform a faster computation
- the system identifies nine types of points: four T's, four L's, and one cross (X) using nine masks.
- the preferred implementation of point pattern matching is performed by creating a string from the points detected in the image and then using the Levenshtein distance to measure the gap between the trained set with the input image.
- the Levenshtein distance between two strings is given by the minimum number of operations needed to transform one string into the other, where an operation is an insertion, deletion, or substitution of a single character.
- the feature storage system saves the calculated features for each document.
- FIG. 9 is a system diagram of the classification system.
- System 556 has a custom lexicon set 907 , a feature retrieval system 923 , a confetti presentation system 925 , an OCR system 927 , a text retrieval system 929 , a key word identification system 941 , a stop word removal system 943 , a key word prioritization system 945 , a vector space creation system 947 , a ranking system 949 and a class identification system 965 .
- the preferred implementation performs classification using both image-level features and textual features.
- the key challenge in classification architecture is defining the classifier appropriate to the domain.
- Many forms and documents, such as business letters, tax forms, mortgage applications, health insurance forms, etc. have structural layouts and associated text, each of which have important domain information.
- the feature retrieval system receives confetti images and point pattern image features (discussed above) for a document from the job database and presents them to the OCR system.
- OCR optical character recognition
- Tesseract an open source software which allows custom modifications.
- a custom domain-specific lexicon set has been added to the system to improve the accuracy of the system.
- key words are collected that are made part of the custom lexicon set of that domain classifier.
- the OCR system converts each confetti image into text.
- the text retrieval system presents the text to the key word identification system.
- the key word identification system receives the confetti text and interfaces with the class database.
- the class database consists of the global dictionary and global priority words which are created by the training process and the point pattern signatures of all the trained forms.
- the stop word removal system removes stop words from the list of text that was received from the text retrieval system.
- Stop words are common words—for example: “a,” “the,” “it,” “not,” and, from the W-2 and 1099-INT example above, words including “omb,” “name,” “address,” etc.
- the stop words are provided by the class database and, in the preferred embodiment, are domain specific.
- the key word prioritization system calculates the priority of each word as function of line height (LnHt) of the word, partial of full match (PFM) with form name and total number of words in that form (N).
- LnHt line height
- PFM partial of full match
- N total number of words in that form
- Partial or Full Match increases the priority if the word partially or fully matches the form name.
- the calculation divides by the number of words in the form (N) to normalize the frequency if the form has a large numbers of words.
- the vector space creation system stores in a table the priority of each word in the form.
- a vector is described as (a 1 , a 2 , . . . ak). Where a 1 , a 2 . . . ak are the magnitude in the respective dimensions. For example, for input words and corresponding line heights of a W-2 tax form, the following are word-priority vectors are stored:
- the ranking system calculates the cosine distance of two vectors V 1 and V 2 as:
- V 1 .V 2 is the dot product of two vectors and
- the class which has the maximum cosine distance with the form is the class to which the form should be classified, and is shown by module 965 .
- the class identification system performs point pattern matching based on the image features collected during image processing.
- the point pattern matching of documents is performed by creating a string from the points detected in the image and then using Levenshtein distance to measure the gap between the trained set with the input image.
- the results of the ranking and the point pattern matching are used to determine the class matching values. If the system is not successful in finding a class match within a defined threshold, the document is marked as unclassified by the success evaluation module defined above.
- FIG. 10 is a system diagram of the organized document development system.
- System 564 has a business rules database 1010 , a business rules engine 1020 , a bookmark and tab library 1040 , a summary page system 1042 and a database update system 1044 .
- the business rules database stores rules that determine the ordering of documents for a given domain. For example, a simple business rule for organizing tax documents is to organize all wage related documents like W-2's first, followed by interest income documents, etc.
- the business rules engine identifies and orders the information in the jobs database.
- the business rules engine applies the rules in the business rules database to the documents in the jobs database.
- FIG. 11 shows such an example of an automatically organized tax source document using the system.
- the bookmark and tag library creates an organized electronic document based on the outputs of the business rules engine.
- the summary page system creates a summary of key document data.
- the summary includes a table of contents, the date and time the document was processed, the name of the job, etc.
- the document is stored in the job database, the job is marked completed and appropriate message is sent to the processing server to make the final finished document available to the end user either through their portal.
- FIG. 12 is a system diagram of the service control manager.
- System 526 has a main thread 1201 , task queues 1202 , database client thread controllers 1203 , task queues 1204 , slave controllers 1205 and SCM queue 1206 .
- the main thread controls the primary state machine for all the jobs in the system.
- Task queues 1202 provide message queues for database communication.
- Database client thread controllers manage the database server interface.
- Task queues 1204 provide message queues for communication with slave controllers.
- Slave controllers manage various slave processes via the slave controller interface.
- the SCM queue provides a mechanism for the various controllers to communicate with the main thread.
- various threads communicate between each other using message queues. Whenever a new document is received for processing, the main thread is notified and it requests the DB client thread to retrieve the job for processing based on the states and the queue of other jobs in the system.
- a finite state machine for that job is created and the job starts to be processed.
- the main thread puts the job on a particular task queue based on the state machine instructions. For example, if the job needs to be image processed, then the job will be placed on the image processing task queue. If the slave controller for the image processing slave finds an idle image processing slave process, then the job is picked up from that queue and given to the slave process for processing. Once the slave finishes performing its assigned task, it returns the job to the slave controller which puts the job back on the SCM queue.
- the main thread sequentially picks up the job from the SCM queue and decides on the next state of the job based on the FSM states. Once a job is completed, the FSM for the job is closed and the organized document is returned to the repository and made available to the client's portal as a finished and processed document.
- FIG. 13 is a diagram that depicts the various components of a computerized document analysis system, according to certain embodiments of the invention.
- An exemplary document analysis system may include a host computer, 1301 that contains volatile memory, 1302 , a persistent storage device such as a hard drive, 1308 , a processor, 1303 , and a network interface, 1304 . Using the network interface, the system computer can interact with databases, 1305 , 1306 .
- FIG. 13 illustrates a system in which the system computer is separate from the various databases, some or all of the databases may be housed within the host computer, eliminating the need for a network interface.
- the programmatic processes may be executed on a single host, as shown in FIG. 13 , or they may be distributed across multiple hosts.
- the host computer shown in FIG. 13 may serve as a document analysis system.
- the host computer receives electronic documents from multiple users.
- Workstations may be connected to a graphical display device, 1307 , and to input devices such as a mouse 1309 , and a keyboard, 1310 .
- the active user's work station may comprise a handheld device.
- the flow charts included in this application describe the logical steps that are embodied as computer executable instructions that could be stored in computer readable medium, such as various memories and disks, that, when executed by a processor, such as a server or server cluster, cause the processor to perform the logical steps.
- While text extraction and recognition may be performed using OCR or OCR-like techniques it is not limited to such. Other techniques could be used, including image recognition-like techniques.
- Organizing electronic documents is not limited to bookmark and/or folder approaches. It includes any ways in which it can be made easier to find and use the documents, such as document tagging.
- image features include inherent image features, e.g., lines, line crossings, etc. that are put in place by the document authors (or authors of an original source or blank document) to organize the document or the like. They were typically not included as a means of identifying the document, even though the inventors have discovered that they can be used as such, especially with the use of machine learning techniques.
Landscapes
- Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- Artificial Intelligence (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
- This application claims the benefit under 35 U.S.C. §119(e) of U.S. Provisional Patent Application Ser. No. 60/985,851, filed on Nov. 6, 2007, which is hereby incorporated by reference herein its entirety.
- This application is related to the following applications filed concurrently herewith, the entire contents of which are incorporated by reference:
- U.S. patent application Ser. No. (TBA), entitled “Systems and Methods for Classifying Electronic Documents by Extracting and Recognizing Text and Image Features Indicative of Document Categories;”
- U.S. patent application Ser. No. (TBA), entitled “Systems and Methods for Training a Document Classification System Using Documents from a Plurality of Users;”
- U.S. patent application Ser. No. (TBA), entitled “Systems and Methods for Parallel Processing of Document Recognition and Classification Using Extracted Image and Text Features;”
- U.S. patent application Ser. No. (TBA), entitled “Systems and Methods for Handling and Distinguishing Binarized, Background Artifacts in the Vicinity of Document Text and Image Features Indicative of a Document Category;” and
- U.S. patent application Ser. No. (TBA), entitled “Systems and Methods for Enabling Manual Classification of Unrecognized Documents to Complete Workflow for Electronic Jobs and to Assist Machine Learning of a Recognition System Using Automatically Extracted Features of Unrecognized Documents.”
- Many software programs are currently available that allow a user to scan a number of paper documents and save them in a single electronic document. The electronic document is typically arranged as a sequence of individual pages. The software programs allow recipients to view, modify, print and store the electronic document. One example of such a document editing program is Adobe Acrobat from Adobe Systems Incorporated of San Jose, Calif.
- In many instances, however, the paper documents are scanned in a random, unorganized sequence, which makes it difficult and time-consuming to find a particular page within the electronic document. One solution can be to manually organize the paper documents prior to scanning; however, the individual organizing the paper documents or performing the scanning may not have the skill, knowledge or time needed to correctly organize the paper documents. Additionally, organizing the paper documents prior to scanning can be very time-consuming and expensive. Further, organizing the pages prior to scanning might properly order the pages, but it does not generate a table of contents, metadata, bookmarks or a hierarchical index that would facilitate finding a particular page within the complete set of pages.
- Ultimately, the recipient may want the pages of the electronic document organized in a specific order to facilitate finding a particular page in timely and inexpensive manner. For example, an assistant may scan forty pages of tax documents in a random order and save the result in an electronic document. In this example, an accountant may then need to organize the pages of the electronic document in a specific order so that navigating through the electronic document during the preparation and review of an income tax return can be performed in an accurate and efficient manner.
- One way that the recipient of an electronic document can organize the pages is by using the thumbnail, metadata and/or bookmark features of the document editing software program. Manually organizing an electronic document, including typing a table of contents, metadata, bookmarks or a hierarchical index, is time-consuming and expensive. Manual organization tends to be ad-hoc, failing to deliver a standardized table of contents, metadata, bookmarks or a hierarchical index for the electronic document.
- Another way that the recipient of an electronic document can organize the pages is by using software that assists in manually categorizing document pages. The software provides a user a pre-identified set of types of documents and associates each page with the type selected by the user. This approach requires the recipient to manually categorize each page, a time-consuming and expensive process.
- Systems and methods to automatically organize electronic jobs by automatically classifying electronic documents using extracted image and text features and using a machine-learning recognition subsystem are provided. In some embodiments, a document analysis system that automatically classifies documents by recognizing in each document distinctive features that have been automatically learned by the system, so that the system may organize jobs according to the categories of documents the job contains, is provided. The document analysis system includes a document acquisition system, a document recognition training system, a document classification system, a document recognition system, and a job organization system. The document acquisition system receives jobs from a plurality of users, each job containing at least one electronic document having at least one page that includes image aspects and text. The document feature recognition system automatically extracts image and text features from each received electronic document. The document classification system automatically classifies recognized electronic documents as belonging to a corresponding category of document by finding the best match between the extracted features of the document and feature sets associated with each category of document, in which each feature set includes a set of image and text features and corresponding weights that distinguishes the respective category of document from the other categories of documents. The document recognition training system automatically trains the feature set for each corresponding category of documents, the training system using extracted features of unrecognized electronic documents to automatically modify the feature set for a document category so that the ability of the document classification system to automatically classify documents improves as the training system is subjected to more and more unrecognized documents and the feature sets are modified accordingly. And the job organization system automatically organizes each job according to the categories of documents it contains by organizing electronic documents associated with each job based on at least one business rule that corresponds to the categories of documents.
- The invention is illustrated in the figures of the accompanying drawings which are meant to be exemplary and not limiting, in which like references are intended to refer to like or corresponding part, and in which:
-
FIG. 1 is a system diagram of a document identification and classification system 100 according to a preferred embodiment of the disclosed subject matter; -
FIG. 2 is a system diagram of theimage capture system 110 according to a preferred embodiment of the disclosed subject matter; -
FIG. 3 is an illustration of three-step document submission process according to a preferred embodiment of the disclosed subject matter; -
FIG. 4 is a system diagram of theserver system 120 according to a preferred embodiment of the disclosed subject matter; -
FIG. 5 is a system diagram of theproduction system 410 according to a preferred embodiment of the disclosed subject matter; -
FIG. 6 is a system diagram of thetraining system 530 according to a preferred embodiment of the disclosed subject matter; -
FIG. 7 is a system diagram of theimage processing system 546 according to a preferred embodiment of the disclosed subject matter; -
FIG. 8 is an illustration of an example of nine types of point patterns according to a preferred embodiment of the disclosed subject matter; -
FIG. 9 is a system diagram of theclassification system 556 according to a preferred embodiment of the disclosed subject matter; -
FIG. 10 illustrates an organized document development system according to a preferred embodiment of the disclosed subject matter; -
FIG. 11 illustrates an example of bookmarked document according to a preferred embodiment of the disclosed subject matter; -
FIG. 12 is a flow diagram of theservice control manager 526 according to a preferred embodiment of the disclosed subject matter; and -
FIG. 13 illustrates an exemplary computer system on which the described invention may run according to a preferred embodiment of the disclosed subject matter. - While the prior art attempts to reduce the cost of electronic document organization through the use of software, none of the above methods of document organization (1) eliminates the human labor and accompanying requirements of education, domain expertise, training, and/or software knowledge, (2) minimizes time spent entering and quality checking page categorization, (3) minimizes errors and (4) protects the privacy of the owners of the data on the electronic documents being organized. What is needed, therefore, is a method of performing electronic document organization that overcomes the above-mentioned limitations and that includes the features numerated above.
- Preferred embodiments of the present invention provide a method and system for converting paper and digital documents into well-organized electronic documents that are indexed, searchable and editable. The resulting organized electronic documents support more rapid and accurate data entry, retrieval and review than randomly sequenced sets of pages.
-
FIG. 1 is a system diagram of a document identification and classification system 100 according to a preferred embodiment of the invention. System 100 has animage capture system 110 and aserver system 120. The image capture system is connected to the production servers by a network such as a local-area network (LAN,) a wide-area network (WAN) or the Internet. The preferred implementation transfers all data over the network using Secure Sockets Layer (SSL) technology with enhanced 128-bit encryption. Encryption certificates can be purchased from well respected certificate authorities such as VeriSign and that or can be generated by using numerous key generation tools in the market today, many of which are available as open source. Alternatively, the files may be transferred over a non-secure network, albeit in a less secure manner. - Under typical operation,
System 110 is an image capture system that receives physical documents and scans them. The image capture system is described in greater detail below. - Under typical operation,
System 120 is a server system that receives the scanned documents over the Internet. Once received, the server system organizes the classified pages per a predetermined scheme into a new, organized document. The server system includes a mechanism for learning documents. The server system is described in greater detail below. -
FIG. 2 is system diagram of theimage capture system 110 according to a preferred embodiment of the invention.System 110 has asource document 210, auser interface system 220, adata transfer system 230, ascanning system 215, adata acquisition system 225 and animage processing system 235. Source documents in the form of papers are physically placed on an input tray of a commercial scanner. Source documents in the form of data files are received over a network by the user interface system. The user interface system communicates with the data transfer system via software within a computer system. The user interface system communicates with the data acquisition system via software within a computer system. The data acquisition system communicates with the scanning system via a physical connection, such as a high-speed Universal Serial Bus (USB) 2.0, or, optionally, over a network. The data acquisition system communicates with the image processing system via software within a computer system. -
Element 210 is a source document in the form of either one or more physical papers or a digital file containing images of one or more papers. The digital file can be in one of many formats, such as PDF, TIFF, BMP, or JPEG. -
System 220 is a user interface system. Under preferred embodiments, the user interface system runs in a browser and presents a user with a three-step means for submitting documents to be organized as shown inFIG. 3 . In step one, the user interface system provides a mechanism for selecting a job from a list of jobs; additionally, it allows jobs to be added to the job list. In step two, the user interface system provides a mechanism for initiating the scanning of physical papers; additionally, is provides a browsing mechanism for selecting a file on a computer or network. Optionally, one or more sets of papers can be scanned and one or more files can be selected. In step three, the user interface system provides a mechanism for sending the job information and selected documents over a network to the server system. Under preferred embodiments, the user interface system also presents a user with the status of jobs that have been submitted as submitted or completed; optionally, it presents the expected completion date and time of submitted jobs that have not been completed. The user interface system also presents a user with a mechanism for receiving submitted documents and organized documents. The user interface system also provides a mechanism for deleting files from the system. -
System 230 is a data transfer system. Under preferred embodiments, the data transfer system manages the SSL connection and associated data transfer with the server system. -
System 215 is a scanning system. Under preferred embodiments, conventional scanning systems may be used such as those from Bell+Howell, Canon, Fujitsu, Kodak, Panasonic and Xerox. The scanning system captures an image of the scanned document as a computer file; the file is often in a standard format such as PDF, TIFF, BMP, or JPEG. -
System 225 is a data acquisition system. The data acquisition system controls the settings of the scanning system. Many scanning systems in use today require users to manually set scanner settings so that images are captured, for example, at 300 dots per inch (dpi) as binary data (black-and-white.) Commercial scanners and scanning software modify the original source document image that often include high resolution and, possibly, color or gray-scale elements. The resolution is often reduced to limit file size. Color and gray-scale elements are often binarized, e.g. converted to black or white pixels, via a process known as threshholding, also to reduce file size. Under preferred embodiments, the data acquisition system sets the scanning system to scan pages double-sided at 300 dpi with eight bits of gray scale. The data acquisition system commands the scanning system to begin operation and receives the scanned document computer file from the scanning operation. -
System 235 is an optional image processing system. The image processing system enhances the image quality of scanned images for a given resolution and other scanner settings. The image processing system may be implemented as part of the image capture system as depicted onFIG. 2 or as part of the server system as depicted onFIG. 8 . Details of the image processing system are described in further detail below as part of the server system. -
FIG. 4 is a system diagram of theserver system 120 according to a preferred embodiment of the invention.System 120 has aproduction system 410, a system ofdatabases 412 and atraining system 414. Under preferred embodiments, the production system is connected to the system of databases by a gigabit Ethernet connection. Under preferred embodiments, the training system is connected to the system of databases by a gigabit Ethernet connection. - The production system classifies each of the pages in the document as one of a pre-identified set of types of documents. The production system organizes the classified pages per a predetermined scheme into a new, organized document. The production system stores the original scanned document and the organized document. The production system is described in greater detail below.
- The system of databases is comprised of a content repository, a job database, a class database and a training database. The system of databases is described in greater detail below.
- The training system utilizes supervised learning to provide a growing set of documents with characterized feature sets to the class database. The training system is described in greater detail below.
-
FIG. 5 is a system diagram of the production system.System 410 has aweb services system 510, anapplication server 512, a document management system and business entities (DMS) 514, aLDAP authentication system 516, acontent repository 518, ajob database 528, aservice control manager 526, animage processing system 546, aclassification system 556, asuccess evaluation system 554, an organizeddocument development system 564, a training database 532, atraining system 530 and aclass database 558. -
FIG. 5 shows the overall system that represents the server-side operation of the automatic organization of electronic documents. The system is comprised of several modules that help in automating and improving the accuracy of the results. The system may be built in a highly distributed architecture and consists of several independent processes, data repositories and databases which communicate and pass messages to each other via well defined standard and proprietary interfaces. Even though the system may be built in a loosely coupled manner to achieve maximum scalability and throughput, the same results can be achieved if the system was more tightly coupled in a single process with each module being a logical entity of the same process. Furthermore, the design of the system considers multiple different product types which may need to process anywhere from hundreds to millions of documents every day for tens to thousands of customers in different markets. Another advantage of the above system design is that it allows the server(s) to be hosted in a secure data center. Documents from healthcare, insurance, banking, government, tax and other applications which will go through the recognition and organization processing system will need security applied per policies that are HIPAA, GLBA, SAS70, etc. compliant. - The
web services 510 system provides the server system connection to the network that interfaces with the image capture system. Such a network could be a local-area network (LAN), a wide-area network (WAN) or the Internet. As described above, the preferred implementation transfers all data over the network using Secure Sockets Layer (SSL) technology with enhanced 128-bit encryption. Standard web services include Apache, RedHat JBoss Web Server, Microsoft IIS, Sun Java System Web Server, IBM Websphere, etc. The primary web service requirement is that the module should be able to handle multiple HTTP or HTTPS requests from different users as they upload their source electronic documents or download their organized electronic document, in a secure manner. The web service should also be able to relay any necessary parameters to the application servers which will process the electronic document. - The
application server 512 provides necessary clustering, caching, load-balancing and persistence of the application for a distributed deployment of large scalable enterprise applications. The application layer manages transaction context as documents are uploaded and downloaded to the system and maintains all necessary service integrity. The application server also provides messaging services, mail services, security services, and connection pool, all of which make the service available to handle a large number of requests simultaneously. - The document management system (DMS) 514 and the business object layer capture the business entities. The DMS is generally a computer-based system or set of servers used to track and store electronic documents and/or images of paper documents. The DMS also commonly provide storage, versioning, metadata, security, as well as indexing and retrieval capabilities. Simple functions could include adding or retrieving an electronic document of a user.
System 514 handles complex business hierarchies for a large number of users across multiple organizations. This is achieved by designing appropriate business objects that access the data access objects (DAO) with appropriate privileges and permissions. The data access objects implement the access mechanism to the data sources. The data source could be a persistent store like an RDBMS, an external service like a B2B exchange, a repository such as the LDAP database ofSystem 516, or a business service accessed via CORBA Internet Inter-ORB Protocol (IIOP) or low-level sockets. The business component that relies on the DAO uses the simpler interface exposed by the DAO for its clients. The DAO completely hides the data source implementation details from its clients. Because the interface exposed by the DAO to clients does not change when the underlying data source implementation changes, this pattern allows the DAO to adapt to different storage schemes without affecting its clients or business components. Essentially, the DAO acts as an adapter between the component and the data source. - The
authentication system 516 allows secure and authorized access to the content repository. Under preferred embodiments, an LDAP authentication system is used; however, other authentication systems can also be used. In general, an LDAP server is used to process queries and updates to an LDAP information directory. For example, a company could store all of the following very efficiently in an LDAP directory: -
- The company employee phone book and organizational chart
- External customer contact information
- Infrastructure services information, including NIS maps, email aliases, and so on
- Configuration information for distributed software packages
- Public certificates and security keys.
- Under a preferred embodiment, document organization and access rights are managed by the access control privileges stored in the LDAP repository.
- The
content repository 518 can be simple file system, a relational database or an object oriented database. Under a preferred embodiment, the content repository is based on Java Standard Requests 170 (JSR 170). JSR 170 is a standard implementation-independent way to access content bi-directionally on a granular level within a content repository. The content repository is a generic application “data store” that can be used for storing both text and binary data (images, word processor documents, PDFs, etc.). One key feature of a content repository is that one does not have to worry about how the data is actually stored: data could be stored in a relational database (RDBMS) or a file system or as an XML document. In addition to providing services for storing and retrieving the data, most content repositories provide advanced services such as uniform access control, searching, versioning, observation, locking, and more. - Under preferred embodiments, documents in the content repository are available to the end user via a portal. For example, in the current implementation of the system, the user can click on a web browser application button “View Source Document” in the portal and view the original scanned document over a secure network. Essentially, the content repository can become an off-site secure storage facility for the user's electronic documents.
- The
job database 528 is used to receive, then process and finally post the user's job back to the content repository. A “job” is defined as the steps of automatically organizing the electronic document from their original scanned images. Module 649 can be file system storage, a relational database, XML document or a combination of these. In the current implementation, the system uses both file system storage to store large blob (binary large objects) and a relational database to store pointers to the blobs and other information pertinent to processing the job. - The service control manager (SCM) 526 is a system that controls the state machine for each job. The state machine identifies the different states and the steps that a job has to progress through to achieve its final objective, in this case being an organized electronic document. In the current system, the SCM is designed to be highly scalable and distributed. Under preferred embodiments, the SCM is multi-threaded to handle hundreds of jobs at any given time. It also implements message queues to communicate with other processes regarding their own states. The SCM can be implemented in other architectures as well. For example, one can implement a complete database driven approach to step through all the different steps required to process such a job.
- In preferred implementations the SCM subscribes to events for each new incoming job that need to be processed. Once a new job arrives, the SCM pre-processes the job by taking the electronic document and separating each image (or page) into its own bitmap image for further processing. For example, if an electronic document had 30 pages, the system will create 30 images for processing. Each job in the system is given a unique identity. Furthermore, each page is given a unique page identity that is linked to the job identity. After the SCM has created image files by pre-processing the document into individual pages, it transitions the state of each page to image processing.
- The
image processing system 546 removes noise from the page image and properly orients the page so that document image analysis can be performed more accurately. The accuracy of the document recognition greatly depends on the quality of the image; thus image processing is included under preferred embodiments. The image processing system performs connected component analysis and, utilizing a line detection system, creates “confetti” images which are small sections of the complete page image. Under preferred embodiments, the confetti images are accompanied the coordinates of the image sub-section. The image processing system is discussed in greater detail below. - The
classification system 556 recognizes the page as one of a pre-identified set of types of documents. A major difficulty in categorizing a page as one of a large number of documents is the high dimensionality of the feature space. Conventional approaches that depend on text categorization alone are faced with a native feature space that consists of many unique terms (words as well as phrases) that occur in documents, which can be hundreds or thousands of terms for even a moderate-sized collection of unique documents. In one domain, multiple systems that categorize income tax documents such as W-2, 1099-INT, K-1 and other forms have experienced poor accuracy because of the thousands of variations of tax documents. The preferred implementation uses a combination of image pattern recognition and text analysis to distinguish documents and machine learning technology to scale to large numbers of documents. The classification system is described in greater detail below. - The
class database 558 contains the trained set of information produced and used by the systems learning engine. As the system grows “smarter” by recognizing more classes and variations of documents, the class database grows. As the machine learning system sees more trained documents, its classification accuracy increases. - The
success evaluation system 554 determines how the document is treated once the classification process has been completed. If the classification system successfully classifies the document, the document is directed toSystem 564, the organized document development system described below. If the classification system fails to recognize the document with a high level of confidence, the document is directed toSystem 530, a training system, described below. - The
training system 530 performs computations on the data in its document database corresponding to the classification systems that are in place and generates datasets used by the classification system for recognizing source documents. The results of the training and re-training process are classification datasets that are updated in the class database. The training system is described in greater detail below. - Thus, the system implements a continuous learning process in which a document that is not accurately identified is sent for training. Training results in an expanded data set in the
class database 558, thereby improving the accuracy of the system over time. As the class database grows, the system requires an asymptotically lower percentage of documents to be trained. -
FIG. 6 is a system diagram of the training system.System 530 has a manual class identification and correction system, a feature set computation system, a collision testing system, a training process manager, a classifier system, and a classification update system. - Preferred implementations use machine learning supported by the training system that adapts to a growing set of documents. Additional documents add additional features that must be analyzed. Preferred implementations of the training system include tuning and optimization to handle noise generated during both the training phase and the testing phase. The training phase is also called learning phase since the parameters and weights are tuned to improve the learning and adaptability of the system by fitting the model that minimizes the error function of the dataset.
- The learning technique in the preferred implementation is supervised learning. Applications in which training data comprises examples of input vectors along with their corresponding target vectors are known as supervised learning problems. Example input vectors include key words and line patterns of the document layouts. Example target vectors include possible classes of output in the organized document. Supervised learning avoids the unstable states that can be reached by unsupervised learning and reinforcement learning systems.
- The learning system receives documents into a
training database 614 from the success evaluation system. These documents are not trained and do not have corresponding classification model data in the class database. All such documents are made persistent in the training database. - The
training database 614 has several tables which contain the document class information as well as image processing information (which is discussed in greater detail below.) The following tables are part of training database: -
- Form class (classification view)
- Page table (details of the page of the electronic document)
- Manual classification table (manual work information)
- Manual training table (trainers' information)
- Confetti table (confetti information, original text, corrected text, etc.).
- The
training process manager 612 is a system that manages the distribution of the training task. Under preferred embodiments, a user, called a “trainer,” logs into the system in which the trainer has privileges at one of three levels. - Trainer levels:
-
- Top tier: add new classes to the system and perform classification and training
- Middle tier: perform manual classification and training
- Bottom tier: only perform training (manual text correction).
- The training process manager directs document processing based on the document state:
-
- Unclassified page is scheduled for manual classification
- Manual classification is done as per policy and form class is assigned
- Job database is updated with form class information and page/job states are changed so that the page can go to next state
- If the form class state is not trained, the form is scheduled for training, else no action is needed.
- After form training, the form class state is changed to trained, not synched if allowed by policy. The document class has the following states:
-
- Untrained
- Partially trained
- Trained, need synch with classification database
- Trained, synched with classification database.
- The manual identification and text correction system processes each document that requires training. The trainer follows two independent steps:
-
- Manually classifying the form and assigning a class and subclass
- Manually correcting OCR text (name required training for now).
- The manual identification and
text correction system 610 is comprised of a number of elements: - Receive pages from the training manager which manages the flow of pages between various trainers and implements training policy and restrictions
- Manual classification user interface (UI) which presents the page and asks the user to classify it
- Manual text correction UI which presents the page with marked up confetti. The user views the confetti and corrects the text extracted from the confetti
- Training viewer UI is used to view the training database in an UI. The preferred implementation includes reports and representations of the training database
- Classification verification UI presents a page and its classification to a trainer.
- All user interfaces are integrated into a single system.
- The feature set
computation system 620 combines the document image, the manually classified information and the corresponding text. - The feature set computation system is the point-pattern matching data which is described in greater detail below.
- The
classifier system 622 creates a global dictionary (GD) and global priority (GPr) of a word based on the words it receives from each document after OCR correction. Global Priority is formulated as: -
GPr=MPr*log(total no. of classes/total no. classes in which the word appears) - where GPr is the global priority of the word and MPr is the mean priority of the word among the classes it is presented. For example, if word1 is present in class1 with priority p1, in class2 with priority p2, in class3 with priority p3, then:
-
MPr=(p1+p2+p3)/3 - If in total there are Nc classes, then:
-
GPr of word1=(p1+p2+p3)/3*log(Nc/3). - Thus, we take inverse word frequency as log (total no of classes/total no classes in which the word appears). So that if a particular word is present in almost all classes then its priority will be low.
- With the value of the priority, a particular word can be determined to be a stop word or not. The preferred implementation stores all the word in a Trie data structure to maximize search speed. With all the words in the global dictionary, a trellis of bi grams of letters is built which is used during prediction of a letter during document OCR.
- Below is an example of 2 different classes (W-2 and 1099-INT) and their priorities and key words:
-
-
payer 0.666667 name 0.333333 address 0.333333 account 0.333333 deferrals 0.333333 income 0.333333 rents 0.333333 royalities 0.333333 omb 0.666667 federal 0.333333 compensation 0.333333 tax 0.333333 parachute 0.333333 misc 1 substitute 0.333333 recipient 0.666667 int 1 foreign 0.666667 with drawl 0.333333 rtn 0.666667 penalty 0.333333 investment 0.333333 -
-
employer 0.666667 wages 1 employee 0.333333 social 0.333333 security 0.333333 dependent 0.333333 federal 0.333333 name 0.333333 address 0.333333 compensation 0.666667 tax 0.333333 omb 0.666667 income 0.333333
The global dictionary and global priorities for the above example are: -
payer 0.462098 name 0 address 0 account 0.231049 deferrals 0.231049 income 0 rents 0.231049 royalities 0.231049 omb 0 federal 0 compensation 0 tax 0 parachute 0.231049 misc 0.693147 substitute 0.231049 recipient 0.462098 int 0.693147 foreign 0.462098 withdrawal 0.231049 rtn 0.462098 penalty 0.231049 investment 0.231049 employer 0.462098 wages 0.693147 employee 0.231049 social 0.231049 security 0.231049 dependent 0.231049. - The above table shows that words like “omb,” “name,” “address,” etc have priorities zero as they are present in all the documents; hence, they are stop words. Words that occur more frequently in all the forms have less priority.
- The
collision testing system 630 performs tests across a large set of trained documents to ensure that previously trained documents do not collide or break with the newly added information. Under preferred embodiments, a large regression suite is built as part of the collision testing system. - The
classification update system 632 inserts new trained data that passes regression testing into the class database. -
FIG. 7 is a system diagram of the image processing system.System 546 has a binarization system, a noise cleanup system, a skew correction system, an orientation correction system, a connected component analysis system, a text line detection system, a confetti generation system, a confetti storage system, a feature identification system and a feature storage system. - Source document images can have salt-pepper noise, skew, orientation in any direction and/or color or gray scale elements. A document can be captured as a color, grayscale or binary image by a scanning device. Common problems seen in images from scanning devices include:
-
- poor contrast due to lack of sufficient or controllable lighting
- non-uniform image background intensity due to uneven illumination
- immoderate amount of random noises due to limited sensitivity of the sensors.
- Many document images are rich in color and have complex backgrounds. Accurately processing such documents typically requires time-consuming processing and manual tuning of various parameters. Detecting text in such documents, which is necessary for text analysis, is difficult for typical text recognition systems that are optimized for binary images on clean backgrounds. For the classification system to work well, document images must be binarized and the text must be readable. Typically, general purpose scanners binarize images using global threshholding utilizing a single threshold value, generally chosen on statistics of the global image. Global threshholding is not adapted well for images that suffer from common illumination or noise problems. Global threshholding often results in characters that are broken, merged or degraded; further, thousands of connected components can be caused by binarization noise. Images degraded by global threshholding are typically candidates for low quality pattern recognition and text analysis.
- The preferred embodiment of the binarization system utilizes local threshholding where the threshold value varies based on the local content in the document image. The preferred implementation is built on a adaptive threshholding technique which exploits local image contrast (reference: IEICE Electronics Express, Vol. 1, No 16, pp. 501-506.) The adaptive nature of this technique is based on flexible weights that are computed based on local mean and standard deviations calculated for the gray values in the primary local zone or window. The preferred embodiment experimentally determines optimum median filters across a large set of document images for each application space.
- The noise cleanup system removes dots, specks and blobs from documents. In the preferred embodiment, minimum and maximum dot size are specified. The preferred embodiment also performs image reversal so that white text or line objects on black backgrounds are detected and inverted to black-on-white. The preferred embodiment also performs two noise removal techniques.
- The first technique starts with any small region of a binary image. The preferred implementation takes a 35×35 pixel region. In this region all background pixels are assigned value “0.” Pixels adjacent to background are given value “1.” A matrix is developed in this manner. In effect each pixel is given a value called the “distance transform” equal to its distance from the closest background pixel. The preferred implementation runs a smoothing technique on this distance transform. Smoothing is a process by which data points are averaged with their neighbors in a series. This typically has the effect of blurring the sharp edges in the smoothed data. Smoothing is sometimes referred to as filtering, because smoothing has the effect of suppressing high frequency signals and enhancing low frequency signals. Of the many different methods of smoothing, the preferred implementation uses a Gaussian kernel. In particular, the preferred implementation performs Gaussian smoothening with a filter using variance of 0.5 and a 3×3 kernel or convolution mask on the distance transform. Threshholding with a thresholding value of 0.85 is performed on the convolved images and the resulting data is converted to its binary space. This method has been tested across a large number of noisy documents and the results have been found to be good.
- The second technique uses connected component analysis (discussed in greater detail below) to identify small or bad blocks. In this method a sliding mask is created of a known size. The preferred implementation uses a mask that is 35×35 pixels wide. This mask slides over the entire image and is used to detect the number of blobs (connected components) that are less than 10 pixels in size. If the number of blobs is greater than five, then all blobs are removed. This process is repeated by sliding the mask over the entire image.
- The skew correction system fixes small angular rotations of the entire document image. Skew correction is important for the document analysis module because it improves text recognition, simplifies interpretation of page layout, improves baseline determination, and improves visual appearance of the final document. Several available image processing libraries do skew correction. The preferred implementation of skew detection is part of the open source Leptonica image processing library.
- The orientation correction system aligns document images so that they can be most easily read. Documents, originally in either portrait or landscape format may be rotated by 0, 90, 180 or 270 degrees during scanning. There are three preferred implementations of orientation correction.
- The first method detects blocks of text in the image and measures each with respect their block height and width. In portrait documents, the average width is more than average height. An average count of the width and height is performed and if the width to height ratio is above a certain threshold, the document is determined to be portrait or landscape.
- The second method performs a baseline analysis, counting the pixels in ascenders and descenders along any line in a document. Heuristically, the number of ascenders is found to be more than the number of descenders in English language documents that are correctly oriented. The document is oriented so that ascenders outnumber descenders.
- The third method performs OCR is on small words or phrase images at all four orientations: 0, 90, 180 and 270 degrees. Small samples are selected from a document and the confidence is averaged across the sample. The orientation that has the highest confidence determines the correct orientation of the document.
- The connected component analysis system implements a very standard technique. In the preferred implementation the open source Image Processing Library 98 (IPL98) is used for connected component analysis.
- The text line detection system implements a technique described by Okun et al. (reference: “Robust Text Detection from Binarized Document Images”) to identify candidate text segments blocks of consistent heights. For a page from a book, this method may identify a whole line as a block, while for a form with many boxes this method will identify the text in each box.
- The confetti generation module identifies all the coordinates of the blocks.
- The confetti storage system stores the confetti information in the appropriate persistent storage.
- The job database stores confetti information for each job.
- The feature identification system looks for point and line features. The preferred implementation performs image layout analysis using two image properties, the point of intersection of lines and edge points, as shown in
FIG. 8 , of text paragraphs. Every unique representation of points is referred as a unique class in the system and represents a unique point pattern in the system database. The preferred implementation uses a heuristically developed convolution method only on black pixels to perform a faster computation The system identifies nine types of points: four T's, four L's, and one cross (X) using nine masks. - The preferred implementation of point pattern matching is performed by creating a string from the points detected in the image and then using the Levenshtein distance to measure the gap between the trained set with the input image. The Levenshtein distance between two strings is given by the minimum number of operations needed to transform one string into the other, where an operation is an insertion, deletion, or substitution of a single character.
- The feature storage system saves the calculated features for each document.
-
FIG. 9 is a system diagram of the classification system.System 556 has a custom lexicon set 907, afeature retrieval system 923, aconfetti presentation system 925, anOCR system 927, atext retrieval system 929, a key word identification system 941, a stopword removal system 943, a keyword prioritization system 945, a vectorspace creation system 947, aranking system 949 and a class identification system 965. - The preferred implementation performs classification using both image-level features and textual features. The key challenge in classification architecture is defining the classifier appropriate to the domain. Many forms and documents, such as business letters, tax forms, mortgage applications, health insurance forms, etc., have structural layouts and associated text, each of which have important domain information.
- The feature retrieval system receives confetti images and point pattern image features (discussed above) for a document from the job database and presents them to the OCR system. Several optical character recognition (OCR) software programs are available in the market today. The preferred implementation uses Tesseract, an open source software which allows custom modifications. In the preferred implementation, a custom domain-specific lexicon set has been added to the system to improve the accuracy of the system. During training of the document for a particular domain, several key words are collected that are made part of the custom lexicon set of that domain classifier.
- The OCR system converts each confetti image into text.
- The text retrieval system presents the text to the key word identification system.
- The key word identification system receives the confetti text and interfaces with the class database. The class database consists of the global dictionary and global priority words which are created by the training process and the point pattern signatures of all the trained forms.
- Under the preferred embodiment, the stop word removal system removes stop words from the list of text that was received from the text retrieval system. Stop words are common words—for example: “a,” “the,” “it,” “not,” and, from the W-2 and 1099-INT example above, words including “omb,” “name,” “address,” etc. The stop words are provided by the class database and, in the preferred embodiment, are domain specific.
- The key word prioritization system, in the preferred implementation, calculates the priority of each word as function of line height (LnHt) of the word, partial of full match (PFM) with form name and total number of words in that form (N). The approximate value of priority is formulated as
-
Pr=(ΣLnHt*PFM)/N. - The summation is taken to give more priority to the word whose frequency is higher in a particular form. Partial or Full Match (PFM) increases the priority if the word partially or fully matches the form name. The calculation divides by the number of words in the form (N) to normalize the frequency if the form has a large numbers of words.
- The vector space creation system stores in a table the priority of each word in the form. A vector is described as (a1, a2, . . . ak). Where a1, a2 . . . ak are the magnitude in the respective dimensions. For example, for input words and corresponding line heights of a W-2 tax form, the following are word-priority vectors are stored:
-
omb 10 employer 5 employer 5 wages 5 compensation 5 compensation 5 dependent 5 wages 10 social 5 security 5 income 5 tax 5 federal 5 name 5 address 5
The normalized valued for the priorities are: -
omb 0.666667 employer 0.666667 wages 1 compensation 0.666667 dependent 0.333333 social 0.333333 security 0.333333 income 0.333333 tax 0.333333 federal 0.333333 name 0.333333 address 0.333333. - In such a vector space, the words with larger font size or higher frequency will have higher priority.
- The ranking system calculates the cosine distance of two vectors V1 and V2 as:
-
cos θ=(V1.V2)/(|V1|*|V2|) - where V1.V2 is the dot product of two vectors and |V| represents the magnitude of the vector. When the cosine distance nears 0, that means the vectors are orthogonal and when it nears 1 it means the vectors are in the same direction or similar.
- The class which has the maximum cosine distance with the form is the class to which the form should be classified, and is shown by module 965.
- The class identification system performs point pattern matching based on the image features collected during image processing. As mentioned earlier, the point pattern matching of documents is performed by creating a string from the points detected in the image and then using Levenshtein distance to measure the gap between the trained set with the input image.
- In the preferred embodiment, the results of the ranking and the point pattern matching are used to determine the class matching values. If the system is not successful in finding a class match within a defined threshold, the document is marked as unclassified by the success evaluation module defined above.
-
FIG. 10 is a system diagram of the organized document development system.System 564 has abusiness rules database 1010, a business rulesengine 1020, a bookmark andtab library 1040, asummary page system 1042 and adatabase update system 1044. - The business rules database stores rules that determine the ordering of documents for a given domain. For example, a simple business rule for organizing tax documents is to organize all wage related documents like W-2's first, followed by interest income documents, etc.
- The business rules engine identifies and orders the information in the jobs database. The business rules engine applies the rules in the business rules database to the documents in the jobs database.
- The
FIG. 11 shows such an example of an automatically organized tax source document using the system. - The bookmark and tag library creates an organized electronic document based on the outputs of the business rules engine.
- The summary page system creates a summary of key document data. In the preferred implementation, the summary includes a table of contents, the date and time the document was processed, the name of the job, etc.
- In the preferred implementation, after the document is fully organized, the document is stored in the job database, the job is marked completed and appropriate message is sent to the processing server to make the final finished document available to the end user either through their portal.
-
FIG. 12 is a system diagram of the service control manager.System 526 has amain thread 1201,task queues 1202, databaseclient thread controllers 1203,task queues 1204,slave controllers 1205 andSCM queue 1206. - The main thread controls the primary state machine for all the jobs in the system.
-
Task queues 1202 provide message queues for database communication. - Database client thread controllers manage the database server interface.
-
Task queues 1204 provide message queues for communication with slave controllers. - Slave controllers manage various slave processes via the slave controller interface.
- The SCM queue provides a mechanism for the various controllers to communicate with the main thread.
- In the preferred implementation, various threads communicate between each other using message queues. Whenever a new document is received for processing, the main thread is notified and it requests the DB client thread to retrieve the job for processing based on the states and the queue of other jobs in the system.
- In the preferred implementation, once the job is loaded in memory, a finite state machine for that job is created and the job starts to be processed. The main thread puts the job on a particular task queue based on the state machine instructions. For example, if the job needs to be image processed, then the job will be placed on the image processing task queue. If the slave controller for the image processing slave finds an idle image processing slave process, then the job is picked up from that queue and given to the slave process for processing. Once the slave finishes performing its assigned task, it returns the job to the slave controller which puts the job back on the SCM queue. The main thread sequentially picks up the job from the SCM queue and decides on the next state of the job based on the FSM states. Once a job is completed, the FSM for the job is closed and the organized document is returned to the repository and made available to the client's portal as a finished and processed document.
-
FIG. 13 is a diagram that depicts the various components of a computerized document analysis system, according to certain embodiments of the invention. An exemplary document analysis system may include a host computer, 1301 that contains volatile memory, 1302, a persistent storage device such as a hard drive, 1308, a processor, 1303, and a network interface, 1304. Using the network interface, the system computer can interact with databases, 1305, 1306. AlthoughFIG. 13 illustrates a system in which the system computer is separate from the various databases, some or all of the databases may be housed within the host computer, eliminating the need for a network interface. The programmatic processes may be executed on a single host, as shown inFIG. 13 , or they may be distributed across multiple hosts. - The host computer shown in
FIG. 13 may serve as a document analysis system. The host computer receives electronic documents from multiple users. Workstations may be connected to a graphical display device, 1307, and to input devices such as amouse 1309, and a keyboard, 1310. Alternately, the active user's work station may comprise a handheld device. - In some embodiments, the flow charts included in this application describe the logical steps that are embodied as computer executable instructions that could be stored in computer readable medium, such as various memories and disks, that, when executed by a processor, such as a server or server cluster, cause the processor to perform the logical steps.
- While text extraction and recognition may be performed using OCR or OCR-like techniques it is not limited to such. Other techniques could be used, including image recognition-like techniques.
- Organizing electronic documents is not limited to bookmark and/or folder approaches. It includes any ways in which it can be made easier to find and use the documents, such as document tagging.
- As described above, preferred embodiments extract image features from a document and use this to assist in classifying the document category. These image features include inherent image features, e.g., lines, line crossings, etc. that are put in place by the document authors (or authors of an original source or blank document) to organize the document or the like. They were typically not included as a means of identifying the document, even though the inventors have discovered that they can be used as such, especially with the use of machine learning techniques.
- While many applications can benefit from extracting both image and text features so that the extracted features may be used to classify documents, for some applications, image features alone may suffice. Specifically, some problem domains may have document categories where the inherent image features are sufficiently distinctive to classify a document with high enough confidence (even without processing text features).
- Although the invention has been described and illustrated in the foregoing illustrative embodiments, it is understood that the present disclosure has been made only by way of example, and that numerous changes in the details of implementation of the invention can be made without departing from the spirit and scope of the invention. Features of the disclosed embodiments can be combined and rearranged in various ways.
Claims (1)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US12/266,462 US20090116736A1 (en) | 2007-11-06 | 2008-11-06 | Systems and methods to automatically classify electronic documents using extracted image and text features and using a machine learning subsystem |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US98585107P | 2007-11-06 | 2007-11-06 | |
US12/266,462 US20090116736A1 (en) | 2007-11-06 | 2008-11-06 | Systems and methods to automatically classify electronic documents using extracted image and text features and using a machine learning subsystem |
Publications (1)
Publication Number | Publication Date |
---|---|
US20090116736A1 true US20090116736A1 (en) | 2009-05-07 |
Family
ID=40588156
Family Applications (6)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/266,472 Abandoned US20090116757A1 (en) | 2007-11-06 | 2008-11-06 | Systems and methods for classifying electronic documents by extracting and recognizing text and image features indicative of document categories |
US12/266,469 Abandoned US20090116756A1 (en) | 2007-11-06 | 2008-11-06 | Systems and methods for training a document classification system using documents from a plurality of users |
US12/266,468 Abandoned US20090116746A1 (en) | 2007-11-06 | 2008-11-06 | Systems and methods for parallel processing of document recognition and classification using extracted image and text features |
US12/266,454 Abandoned US20090116755A1 (en) | 2007-11-06 | 2008-11-06 | Systems and methods for enabling manual classification of unrecognized documents to complete workflow for electronic jobs and to assist machine learning of a recognition system using automatically extracted features of unrecognized documents |
US12/266,462 Abandoned US20090116736A1 (en) | 2007-11-06 | 2008-11-06 | Systems and methods to automatically classify electronic documents using extracted image and text features and using a machine learning subsystem |
US12/266,465 Active 2029-05-25 US8538184B2 (en) | 2007-11-06 | 2008-11-06 | Systems and methods for handling and distinguishing binarized, background artifacts in the vicinity of document text and image features indicative of a document category |
Family Applications Before (4)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/266,472 Abandoned US20090116757A1 (en) | 2007-11-06 | 2008-11-06 | Systems and methods for classifying electronic documents by extracting and recognizing text and image features indicative of document categories |
US12/266,469 Abandoned US20090116756A1 (en) | 2007-11-06 | 2008-11-06 | Systems and methods for training a document classification system using documents from a plurality of users |
US12/266,468 Abandoned US20090116746A1 (en) | 2007-11-06 | 2008-11-06 | Systems and methods for parallel processing of document recognition and classification using extracted image and text features |
US12/266,454 Abandoned US20090116755A1 (en) | 2007-11-06 | 2008-11-06 | Systems and methods for enabling manual classification of unrecognized documents to complete workflow for electronic jobs and to assist machine learning of a recognition system using automatically extracted features of unrecognized documents |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/266,465 Active 2029-05-25 US8538184B2 (en) | 2007-11-06 | 2008-11-06 | Systems and methods for handling and distinguishing binarized, background artifacts in the vicinity of document text and image features indicative of a document category |
Country Status (2)
Country | Link |
---|---|
US (6) | US20090116757A1 (en) |
WO (1) | WO2009061917A1 (en) |
Cited By (31)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080077583A1 (en) * | 2006-09-22 | 2008-03-27 | Pluggd Inc. | Visual interface for identifying positions of interest within a sequentially ordered information encoding |
US20090116756A1 (en) * | 2007-11-06 | 2009-05-07 | Copanion, Inc. | Systems and methods for training a document classification system using documents from a plurality of users |
US20110093258A1 (en) * | 2009-10-15 | 2011-04-21 | 2167959 Ontario Inc. | System and method for text cleaning |
US20110135162A1 (en) * | 2009-12-09 | 2011-06-09 | Canon Kabushiki Kaisha | Image processing apparatus, image processing method and recording medium |
US20110282879A1 (en) * | 2006-09-22 | 2011-11-17 | Limelight Networks, Inc. | Method and subsystem for information acquisition and aggregation to facilitate ontology and language model generation within a content-search-service system |
US20120189194A1 (en) * | 2011-01-26 | 2012-07-26 | Microsoft Corporation | Mitigating use of machine solvable hips |
US20130021340A1 (en) * | 2011-07-19 | 2013-01-24 | SAMSUNG ELECTRONICS CO., LTD. of Suwon-si Republic of Korea | Method and apparatus for obtaining lighting information and material information in image modeling system |
US9015172B2 (en) | 2006-09-22 | 2015-04-21 | Limelight Networks, Inc. | Method and subsystem for searching media content within a content-search service system |
TWI505207B (en) * | 2014-03-26 | 2015-10-21 | Excellence Inc E | System and Method of Automatically Delivering Electronic Official Documents |
CN105426462A (en) * | 2015-11-13 | 2016-03-23 | 深圳码隆科技有限公司 | Image searching method and device based on image element |
US20170206409A1 (en) * | 2016-01-20 | 2017-07-20 | Accenture Global Solutions Limited | Cognitive document reader |
US10185765B2 (en) * | 2012-09-06 | 2019-01-22 | Fuji Xerox Co., Ltd. | Non-transitory computer-readable medium, information classification method, and information processing apparatus |
US10204143B1 (en) | 2011-11-02 | 2019-02-12 | Dub Software Group, Inc. | System and method for automatic document management |
US10546054B1 (en) * | 2018-02-28 | 2020-01-28 | Intuit Inc. | System and method for synthetic form image generation |
US10685051B2 (en) | 2012-10-31 | 2020-06-16 | Open Text Corporation | Reconfigurable model for auto-classification system and method |
CN111383299A (en) * | 2018-12-28 | 2020-07-07 | Tcl集团股份有限公司 | Image processing method and device and computer readable storage medium |
JP2020135736A (en) * | 2019-02-25 | 2020-08-31 | コニカミノルタ株式会社 | Image formation apparatus and document management system |
US10839302B2 (en) | 2015-11-24 | 2020-11-17 | The Research Foundation For The State University Of New York | Approximate value iteration with complex returns by bounding |
WO2020243846A1 (en) * | 2019-06-06 | 2020-12-10 | Bear Health Technologies Inc. | System and method for automated file reporting |
US10884981B1 (en) | 2017-06-19 | 2021-01-05 | Wells Fargo Bank, N.A. | Tagging tool for managing data |
WO2021011776A1 (en) * | 2019-07-16 | 2021-01-21 | nference, inc. | Systems and methods for populating a structured database based on an image representation of a data table |
US20220147843A1 (en) * | 2020-11-12 | 2022-05-12 | Samsung Electronics Co., Ltd. | On-device knowledge extraction from visually rich documents |
US11335108B2 (en) | 2020-08-10 | 2022-05-17 | Marlabs Incorporated | System and method to recognise characters from an image |
US11487902B2 (en) | 2019-06-21 | 2022-11-01 | nference, inc. | Systems and methods for computing with private healthcare data |
US11545242B2 (en) | 2019-06-21 | 2023-01-03 | nference, inc. | Systems and methods for computing with private healthcare data |
US11556549B2 (en) | 2019-09-05 | 2023-01-17 | Yandex Europe Ag | Method and system for ranking plurality of digital documents |
US11704352B2 (en) | 2021-05-03 | 2023-07-18 | Bank Of America Corporation | Automated categorization and assembly of low-quality images into electronic documents |
US11798258B2 (en) | 2021-05-03 | 2023-10-24 | Bank Of America Corporation | Automated categorization and assembly of low-quality images into electronic documents |
US11881041B2 (en) | 2021-09-02 | 2024-01-23 | Bank Of America Corporation | Automated categorization and processing of document images of varying degrees of quality |
WO2024020701A1 (en) * | 2022-07-28 | 2024-02-01 | Wisedocs Inc. | System and method for automated file reporting |
US11900274B2 (en) | 2016-09-22 | 2024-02-13 | nference, inc. | Systems, methods, and computer readable media for visualization of semantic information and inference of temporal signals indicating salient associations between life science entities |
Families Citing this family (76)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
RU2635259C1 (en) * | 2016-06-22 | 2017-11-09 | Общество с ограниченной ответственностью "Аби Девелопмент" | Method and device for determining type of digital document |
US7860844B2 (en) * | 2005-07-15 | 2010-12-28 | Indxit Systems Inc. | System and methods for data indexing and processing |
US20090210786A1 (en) * | 2008-02-19 | 2009-08-20 | Kabushiki Kaisha Toshiba | Image processing apparatus and image processing method |
US8671112B2 (en) * | 2008-06-12 | 2014-03-11 | Athenahealth, Inc. | Methods and apparatus for automated image classification |
US8713007B1 (en) | 2009-03-13 | 2014-04-29 | Google Inc. | Classifying documents using multiple classifiers |
US20230260657A1 (en) * | 2009-05-28 | 2023-08-17 | Ai Visualize, Inc. | Method and system for fast access to advanced visualization of medical scans using a dedicated web portal |
US10726955B2 (en) * | 2009-05-28 | 2020-07-28 | Ai Visualize, Inc. | Method and system for fast access to advanced visualization of medical scans using a dedicated web portal |
US8701167B2 (en) * | 2009-05-28 | 2014-04-15 | Kjaya, Llc | Method and system for fast access to advanced visualization of medical scans using a dedicated web portal |
US20110258195A1 (en) * | 2010-01-15 | 2011-10-20 | Girish Welling | Systems and methods for automatically reducing data search space and improving data extraction accuracy using known constraints in a layout of extracted data elements |
US20110191145A1 (en) * | 2010-02-02 | 2011-08-04 | Bank Of America Corporation | Digital Records Management |
JP5703748B2 (en) * | 2010-03-17 | 2015-04-22 | 株式会社リコー | Management system, management method, and temporary storage document server |
US20130051651A1 (en) * | 2010-05-07 | 2013-02-28 | Purdue Research Foundation | Quantitative image analysis for wound healing assay |
US8428375B2 (en) * | 2010-11-17 | 2013-04-23 | Via Technologies, Inc. | System and method for data compression and decompression in a graphics processing system |
US8379980B2 (en) * | 2011-03-25 | 2013-02-19 | Intel Corporation | System, method and computer program product for document image analysis using feature extraction functions |
US8527532B2 (en) * | 2012-01-31 | 2013-09-03 | Adobe Systems Incorporated | Transforming function calls for interaction with hierarchical data structures |
JP2014036314A (en) * | 2012-08-08 | 2014-02-24 | Canon Inc | Scan service system, scan service method, and scan service program |
US9286379B2 (en) * | 2012-11-26 | 2016-03-15 | Wal-Mart Stores, Inc. | Document quality measurement |
US8885951B1 (en) | 2012-12-14 | 2014-11-11 | Tony Cristofano | System and method for data identification and extraction of forms |
FR3000585A1 (en) * | 2012-12-31 | 2014-07-04 | Eads Europ Aeronautic Defence | METHOD FOR ANALYZING GEOGRAPHIC REGIONS AND DETECTING ZONES OF INTEREST |
US9703855B1 (en) * | 2013-04-15 | 2017-07-11 | Ca, Inc. | System and method for classifying content with a web service |
US10162829B2 (en) * | 2013-09-03 | 2018-12-25 | Adobe Systems Incorporated | Adaptive parallel data processing |
US9286372B2 (en) | 2013-11-06 | 2016-03-15 | Sap Se | Content management with RDBMS |
RU2641225C2 (en) * | 2014-01-21 | 2018-01-16 | Общество с ограниченной ответственностью "Аби Девелопмент" | Method of detecting necessity of standard learning for verification of recognized text |
US20170109439A1 (en) * | 2014-06-03 | 2017-04-20 | Hewlett-Packard Development Company, L.P. | Document classification based on multiple meta-algorithmic patterns |
CN106462580A (en) * | 2014-09-24 | 2017-02-22 | 惠普发展公司,有限责任合伙企业 | Media organization |
JP6325407B2 (en) * | 2014-09-29 | 2018-05-16 | 株式会社東芝 | Information processing apparatus, information processing method, and program |
US9367899B1 (en) * | 2015-05-29 | 2016-06-14 | Konica Minolta Laboratory U.S.A., Inc. | Document image binarization method |
US10726281B2 (en) * | 2015-07-29 | 2020-07-28 | Invensense, Inc. | Method and apparatus for user and moving vehicle detection |
US10482462B1 (en) | 2016-03-18 | 2019-11-19 | Wells Fargo Bank, N.A. | Automatic teller machine game-based authentication functionality |
WO2017197402A2 (en) * | 2016-05-13 | 2017-11-16 | Maana, Inc. | Machine-assisted object matching |
US10776399B1 (en) | 2016-06-06 | 2020-09-15 | Casepoint LLC | Document classification prediction and content analytics using artificial intelligence |
US10095747B1 (en) | 2016-06-06 | 2018-10-09 | @Legal Discovery LLC | Similar document identification using artificial intelligence |
US11222266B2 (en) | 2016-07-15 | 2022-01-11 | Intuit Inc. | System and method for automatic learning of functions |
US10140277B2 (en) | 2016-07-15 | 2018-11-27 | Intuit Inc. | System and method for selecting data sample groups for machine learning of context of data fields for various document types and/or for test data generation for quality assurance systems |
US10579721B2 (en) | 2016-07-15 | 2020-03-03 | Intuit Inc. | Lean parsing: a natural language processing system and method for parsing domain-specific languages |
US11049190B2 (en) | 2016-07-15 | 2021-06-29 | Intuit Inc. | System and method for automatically generating calculations for fields in compliance forms |
US10725896B2 (en) | 2016-07-15 | 2020-07-28 | Intuit Inc. | System and method for identifying a subset of total historical users of a document preparation system to represent a full set of test scenarios based on code coverage |
US9984471B2 (en) * | 2016-07-26 | 2018-05-29 | Intuit Inc. | Label and field identification without optical character recognition (OCR) |
US10607101B1 (en) * | 2016-12-14 | 2020-03-31 | Revenue Management Solutions, Llc | System and method for patterned artifact removal for bitonal images |
US10331732B1 (en) * | 2016-12-16 | 2019-06-25 | National Technology & Engineering Solutions Of Sandia, Llc | Information searching system |
WO2018150211A1 (en) * | 2017-02-20 | 2018-08-23 | Csiba Andras | Method for handling documents on ontology base |
US10663298B2 (en) * | 2017-06-25 | 2020-05-26 | Invensense, Inc. | Method and apparatus for characterizing platform motion |
CN107480711B (en) * | 2017-08-04 | 2020-09-01 | 合肥美的智能科技有限公司 | Image recognition method and device, computer equipment and readable storage medium |
CN107563379B (en) * | 2017-09-02 | 2019-12-24 | 西安电子科技大学 | Method for positioning text in natural scene image |
AU2018337034B2 (en) * | 2017-09-22 | 2021-02-18 | Intuit Inc. | Lean parsing: a natural language processing system and method for parsing domain-specific languages |
US11176363B2 (en) * | 2017-09-29 | 2021-11-16 | AO Kaspersky Lab | System and method of training a classifier for determining the category of a document |
RU2672395C1 (en) * | 2017-09-29 | 2018-11-14 | Акционерное общество "Лаборатория Касперского" | Method for training a classifier designed for determining the category of a document |
US11816435B1 (en) | 2018-02-19 | 2023-11-14 | Narrative Science Inc. | Applied artificial intelligence technology for contextualizing words to a knowledge base using natural language processing |
RU2695489C1 (en) * | 2018-03-23 | 2019-07-23 | Общество с ограниченной ответственностью "Аби Продакшн" | Identification of fields on an image using artificial intelligence |
US10162850B1 (en) | 2018-04-10 | 2018-12-25 | Icertis, Inc. | Clause discovery for validation of documents |
US11042713B1 (en) | 2018-06-28 | 2021-06-22 | Narrative Scienc Inc. | Applied artificial intelligence technology for using natural language processing to train a natural language generation system |
US10936974B2 (en) | 2018-12-24 | 2021-03-02 | Icertis, Inc. | Automated training and selection of models for document analysis |
US11462037B2 (en) | 2019-01-11 | 2022-10-04 | Walmart Apollo, Llc | System and method for automated analysis of electronic travel data |
US10990767B1 (en) | 2019-01-28 | 2021-04-27 | Narrative Science Inc. | Applied artificial intelligence technology for adaptive natural language understanding |
US10726374B1 (en) | 2019-02-19 | 2020-07-28 | Icertis, Inc. | Risk prediction based on automated analysis of documents |
US11373029B2 (en) | 2019-04-01 | 2022-06-28 | Hyland Uk Operations Limited | System and method integrating machine learning algorithms to enrich documents in a content management system |
US11151660B1 (en) * | 2019-04-03 | 2021-10-19 | Progressive Casualty Insurance Company | Intelligent routing control |
US10657603B1 (en) * | 2019-04-03 | 2020-05-19 | Progressive Casualty Insurance Company | Intelligent routing control |
US11328025B1 (en) | 2019-04-26 | 2022-05-10 | Bank Of America Corporation | Validating mappings between documents using machine learning |
US11783005B2 (en) | 2019-04-26 | 2023-10-10 | Bank Of America Corporation | Classifying and mapping sentences using machine learning |
US11163956B1 (en) | 2019-05-23 | 2021-11-02 | Intuit Inc. | System and method for recognizing domain specific named entities using domain specific word embeddings |
KR20210001760A (en) | 2019-06-28 | 2021-01-06 | 휴렛-팩커드 디벨롭먼트 컴퍼니, 엘.피. | Detecting and processing multi feeding |
US11423231B2 (en) | 2019-08-27 | 2022-08-23 | Bank Of America Corporation | Removing outliers from training data for machine learning |
US11449559B2 (en) | 2019-08-27 | 2022-09-20 | Bank Of America Corporation | Identifying similar sentences for machine learning |
US11556711B2 (en) | 2019-08-27 | 2023-01-17 | Bank Of America Corporation | Analyzing documents using machine learning |
US11526804B2 (en) | 2019-08-27 | 2022-12-13 | Bank Of America Corporation | Machine learning model training for reviewing documents |
CN110781234A (en) * | 2019-10-24 | 2020-02-11 | 北京锐安科技有限公司 | TRS database retrieval method, device, equipment and storage medium |
KR20210066398A (en) * | 2019-11-28 | 2021-06-07 | 휴렛-팩커드 디벨롭먼트 컴퍼니, 엘.피. | Document management of image forming device |
US11783128B2 (en) | 2020-02-19 | 2023-10-10 | Intuit Inc. | Financial document text conversion to computer readable operations |
US20210294851A1 (en) * | 2020-03-23 | 2021-09-23 | UiPath, Inc. | System and method for data augmentation for document understanding |
US11829661B2 (en) | 2020-04-21 | 2023-11-28 | Hewlett-Packard Development Company, L.P. | Media feed rate adjustments |
RU2764705C1 (en) | 2020-12-22 | 2022-01-19 | Общество с ограниченной ответственностью «Аби Продакшн» | Extraction of multiple documents from a single image |
WO2022150042A1 (en) | 2021-01-08 | 2022-07-14 | Hewlett-Packard Development Company, L.P. | Feature extractions to optimize scanned images |
IT202100016208A1 (en) * | 2021-06-21 | 2022-12-21 | Witit S R L Start Up Costituita A Norma Dellarticolo 4 Comma 10 Bis Del Decreto Legge 24 Gennaio 201 | Method and system for the digital acquisition of paper documents |
US11830267B2 (en) | 2021-08-27 | 2023-11-28 | Optum, Inc. | Techniques for digital document analysis using document image fingerprinting |
US11361034B1 (en) | 2021-11-30 | 2022-06-14 | Icertis, Inc. | Representing documents using document keys |
Citations (35)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5642288A (en) * | 1994-11-10 | 1997-06-24 | Documagix, Incorporated | Intelligent document recognition and handling |
US5680478A (en) * | 1992-04-24 | 1997-10-21 | Canon Kabushiki Kaisha | Method and apparatus for character recognition |
US5778103A (en) * | 1992-10-19 | 1998-07-07 | Tmssequoia | OCR image pre-processor |
US5943669A (en) * | 1996-11-25 | 1999-08-24 | Fuji Xerox Co., Ltd. | Document retrieval device |
US5995665A (en) * | 1995-05-31 | 1999-11-30 | Canon Kabushiki Kaisha | Image processing apparatus and method |
US5999664A (en) * | 1997-11-14 | 1999-12-07 | Xerox Corporation | System for searching a corpus of document images by user specified document layout components |
US6006226A (en) * | 1997-09-24 | 1999-12-21 | Ricoh Company Limited | Method and system for document image feature extraction |
US6094653A (en) * | 1996-12-25 | 2000-07-25 | Nec Corporation | Document classification method and apparatus therefor |
US6101515A (en) * | 1996-05-31 | 2000-08-08 | Oracle Corporation | Learning system for classification of terminology |
US6243501B1 (en) * | 1998-05-20 | 2001-06-05 | Canon Kabushiki Kaisha | Adaptive recognition of documents using layout attributes |
US20020022956A1 (en) * | 2000-05-25 | 2002-02-21 | Igor Ukrainczyk | System and method for automatically classifying text |
US6393150B1 (en) * | 1998-12-04 | 2002-05-21 | Eastman Kodak Company | Region-based image binarization system |
US20030226100A1 (en) * | 2002-05-17 | 2003-12-04 | Xerox Corporation | Systems and methods for authoritativeness grading, estimation and sorting of documents in large heterogeneous document collections |
US6823331B1 (en) * | 2000-08-28 | 2004-11-23 | Entrust Limited | Concept identification system and method for use in reducing and/or representing text content of an electronic document |
US20050060643A1 (en) * | 2003-08-25 | 2005-03-17 | Miavia, Inc. | Document similarity detection and classification system |
US20050117803A1 (en) * | 2003-11-28 | 2005-06-02 | Canon Kabushiki Kaisha | Document recognition device, document recognition method and program, and storage medium |
US6943905B2 (en) * | 2001-12-20 | 2005-09-13 | Sharp Laboratories Of America, Inc. | Virtual print driver system and method |
US6947933B2 (en) * | 2003-01-23 | 2005-09-20 | Verdasys, Inc. | Identifying similarities within large collections of unstructured data |
US20050244060A1 (en) * | 2004-04-30 | 2005-11-03 | Xerox Corporation | Reformatting binary image data to generate smaller compressed image data size |
US6976207B1 (en) * | 1999-04-28 | 2005-12-13 | Ser Solutions, Inc. | Classification method and apparatus |
US20060036649A1 (en) * | 2004-08-12 | 2006-02-16 | Simske Steven J | Index extraction from documents |
US7039856B2 (en) * | 1998-09-30 | 2006-05-02 | Ricoh Co., Ltd. | Automatic document classification using text and images |
US20060190489A1 (en) * | 2005-02-23 | 2006-08-24 | Janet Vohariwatt | System and method for electronically processing document images |
US7190477B2 (en) * | 2001-02-22 | 2007-03-13 | Sharp Laboratories Of America, Inc. | System and method for managing and processing a print job using print job tickets |
US7194471B1 (en) * | 1998-04-10 | 2007-03-20 | Ricoh Company, Ltd. | Document classification system and method for classifying a document according to contents of the document |
US20070118391A1 (en) * | 2005-10-24 | 2007-05-24 | Capsilon Fsg, Inc. | Business Method Using The Automated Processing of Paper and Unstructured Electronic Documents |
US20070201764A1 (en) * | 2006-02-27 | 2007-08-30 | Samsung Electronics Co., Ltd. | Apparatus and method for detecting key caption from moving picture to provide customized broadcast service |
US20070203885A1 (en) * | 2006-02-28 | 2007-08-30 | Korea Advanced Institute Of Science & Technology | Document Classification Method, and Computer Readable Record Medium Having Program for Executing Document Classification Method By Computer |
US20070211964A1 (en) * | 2006-03-09 | 2007-09-13 | Gad Agam | Image-based indexing and classification in image databases |
US20070247531A1 (en) * | 2006-04-19 | 2007-10-25 | Yining Deng | Method and system to reduce flicker artifacts in captured images |
US20080062472A1 (en) * | 2006-09-12 | 2008-03-13 | Morgan Stanley | Document handling |
US20090119296A1 (en) * | 2007-11-06 | 2009-05-07 | Copanion, Inc. | Systems and methods for handling and distinguishing binarized, background artifacts in the vicinity of document text and image features indicative of a document category |
US7623712B2 (en) * | 2005-06-09 | 2009-11-24 | Canon Kabushiki Kaisha | Image processing method and apparatus |
US7783117B2 (en) * | 2005-08-12 | 2010-08-24 | Seiko Epson Corporation | Systems and methods for generating background and foreground images for document compression |
US7797260B2 (en) * | 2008-02-11 | 2010-09-14 | Yahoo! Inc. | Automated document classifier tuning including training set adaptive to user browsing behavior |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US116757A (en) * | 1871-07-04 | Improvement in book-binding apparatus | ||
US116736A (en) * | 1871-07-04 | Improvement in weeding-tools | ||
US119296A (en) * | 1871-09-26 | Improvement in whip-stocks | ||
US116746A (en) * | 1871-07-04 | Improvement in sleeping-cars | ||
US116756A (en) * | 1871-07-04 | Improvement in spring vehicles | ||
US5778106A (en) * | 1996-03-14 | 1998-07-07 | Polaroid Corporation | Electronic camera with reduced color artifacts |
-
2008
- 2008-11-06 WO PCT/US2008/082631 patent/WO2009061917A1/en active Application Filing
- 2008-11-06 US US12/266,472 patent/US20090116757A1/en not_active Abandoned
- 2008-11-06 US US12/266,469 patent/US20090116756A1/en not_active Abandoned
- 2008-11-06 US US12/266,468 patent/US20090116746A1/en not_active Abandoned
- 2008-11-06 US US12/266,454 patent/US20090116755A1/en not_active Abandoned
- 2008-11-06 US US12/266,462 patent/US20090116736A1/en not_active Abandoned
- 2008-11-06 US US12/266,465 patent/US8538184B2/en active Active
Patent Citations (41)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5680478A (en) * | 1992-04-24 | 1997-10-21 | Canon Kabushiki Kaisha | Method and apparatus for character recognition |
US5778103A (en) * | 1992-10-19 | 1998-07-07 | Tmssequoia | OCR image pre-processor |
US5642288A (en) * | 1994-11-10 | 1997-06-24 | Documagix, Incorporated | Intelligent document recognition and handling |
US5995665A (en) * | 1995-05-31 | 1999-11-30 | Canon Kabushiki Kaisha | Image processing apparatus and method |
US6101515A (en) * | 1996-05-31 | 2000-08-08 | Oracle Corporation | Learning system for classification of terminology |
US5943669A (en) * | 1996-11-25 | 1999-08-24 | Fuji Xerox Co., Ltd. | Document retrieval device |
US6094653A (en) * | 1996-12-25 | 2000-07-25 | Nec Corporation | Document classification method and apparatus therefor |
US6006226A (en) * | 1997-09-24 | 1999-12-21 | Ricoh Company Limited | Method and system for document image feature extraction |
US5999664A (en) * | 1997-11-14 | 1999-12-07 | Xerox Corporation | System for searching a corpus of document images by user specified document layout components |
US7194471B1 (en) * | 1998-04-10 | 2007-03-20 | Ricoh Company, Ltd. | Document classification system and method for classifying a document according to contents of the document |
US6243501B1 (en) * | 1998-05-20 | 2001-06-05 | Canon Kabushiki Kaisha | Adaptive recognition of documents using layout attributes |
US7039856B2 (en) * | 1998-09-30 | 2006-05-02 | Ricoh Co., Ltd. | Automatic document classification using text and images |
US6393150B1 (en) * | 1998-12-04 | 2002-05-21 | Eastman Kodak Company | Region-based image binarization system |
US6976207B1 (en) * | 1999-04-28 | 2005-12-13 | Ser Solutions, Inc. | Classification method and apparatus |
US20020022956A1 (en) * | 2000-05-25 | 2002-02-21 | Igor Ukrainczyk | System and method for automatically classifying text |
US6823331B1 (en) * | 2000-08-28 | 2004-11-23 | Entrust Limited | Concept identification system and method for use in reducing and/or representing text content of an electronic document |
US7190477B2 (en) * | 2001-02-22 | 2007-03-13 | Sharp Laboratories Of America, Inc. | System and method for managing and processing a print job using print job tickets |
US6943905B2 (en) * | 2001-12-20 | 2005-09-13 | Sharp Laboratories Of America, Inc. | Virtual print driver system and method |
US20030226100A1 (en) * | 2002-05-17 | 2003-12-04 | Xerox Corporation | Systems and methods for authoritativeness grading, estimation and sorting of documents in large heterogeneous document collections |
US6947933B2 (en) * | 2003-01-23 | 2005-09-20 | Verdasys, Inc. | Identifying similarities within large collections of unstructured data |
US20050060643A1 (en) * | 2003-08-25 | 2005-03-17 | Miavia, Inc. | Document similarity detection and classification system |
US20050117803A1 (en) * | 2003-11-28 | 2005-06-02 | Canon Kabushiki Kaisha | Document recognition device, document recognition method and program, and storage medium |
US20050244060A1 (en) * | 2004-04-30 | 2005-11-03 | Xerox Corporation | Reformatting binary image data to generate smaller compressed image data size |
US20060036649A1 (en) * | 2004-08-12 | 2006-02-16 | Simske Steven J | Index extraction from documents |
US20060190489A1 (en) * | 2005-02-23 | 2006-08-24 | Janet Vohariwatt | System and method for electronically processing document images |
US7623712B2 (en) * | 2005-06-09 | 2009-11-24 | Canon Kabushiki Kaisha | Image processing method and apparatus |
US7783117B2 (en) * | 2005-08-12 | 2010-08-24 | Seiko Epson Corporation | Systems and methods for generating background and foreground images for document compression |
US20070118391A1 (en) * | 2005-10-24 | 2007-05-24 | Capsilon Fsg, Inc. | Business Method Using The Automated Processing of Paper and Unstructured Electronic Documents |
US7747495B2 (en) * | 2005-10-24 | 2010-06-29 | Capsilon Corporation | Business method using the automated processing of paper and unstructured electronic documents |
US20070201764A1 (en) * | 2006-02-27 | 2007-08-30 | Samsung Electronics Co., Ltd. | Apparatus and method for detecting key caption from moving picture to provide customized broadcast service |
US20070203885A1 (en) * | 2006-02-28 | 2007-08-30 | Korea Advanced Institute Of Science & Technology | Document Classification Method, and Computer Readable Record Medium Having Program for Executing Document Classification Method By Computer |
US20070211964A1 (en) * | 2006-03-09 | 2007-09-13 | Gad Agam | Image-based indexing and classification in image databases |
US7787711B2 (en) * | 2006-03-09 | 2010-08-31 | Illinois Institute Of Technology | Image-based indexing and classification in image databases |
US20070247531A1 (en) * | 2006-04-19 | 2007-10-25 | Yining Deng | Method and system to reduce flicker artifacts in captured images |
US20080062472A1 (en) * | 2006-09-12 | 2008-03-13 | Morgan Stanley | Document handling |
US20090116757A1 (en) * | 2007-11-06 | 2009-05-07 | Copanion, Inc. | Systems and methods for classifying electronic documents by extracting and recognizing text and image features indicative of document categories |
US20090116746A1 (en) * | 2007-11-06 | 2009-05-07 | Copanion, Inc. | Systems and methods for parallel processing of document recognition and classification using extracted image and text features |
US20090116756A1 (en) * | 2007-11-06 | 2009-05-07 | Copanion, Inc. | Systems and methods for training a document classification system using documents from a plurality of users |
US20090116755A1 (en) * | 2007-11-06 | 2009-05-07 | Copanion, Inc. | Systems and methods for enabling manual classification of unrecognized documents to complete workflow for electronic jobs and to assist machine learning of a recognition system using automatically extracted features of unrecognized documents |
US20090119296A1 (en) * | 2007-11-06 | 2009-05-07 | Copanion, Inc. | Systems and methods for handling and distinguishing binarized, background artifacts in the vicinity of document text and image features indicative of a document category |
US7797260B2 (en) * | 2008-02-11 | 2010-09-14 | Yahoo! Inc. | Automated document classifier tuning including training set adaptive to user browsing behavior |
Cited By (52)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080077583A1 (en) * | 2006-09-22 | 2008-03-27 | Pluggd Inc. | Visual interface for identifying positions of interest within a sequentially ordered information encoding |
US8966389B2 (en) | 2006-09-22 | 2015-02-24 | Limelight Networks, Inc. | Visual interface for identifying positions of interest within a sequentially ordered information encoding |
US9015172B2 (en) | 2006-09-22 | 2015-04-21 | Limelight Networks, Inc. | Method and subsystem for searching media content within a content-search service system |
US20110282879A1 (en) * | 2006-09-22 | 2011-11-17 | Limelight Networks, Inc. | Method and subsystem for information acquisition and aggregation to facilitate ontology and language model generation within a content-search-service system |
US20090119296A1 (en) * | 2007-11-06 | 2009-05-07 | Copanion, Inc. | Systems and methods for handling and distinguishing binarized, background artifacts in the vicinity of document text and image features indicative of a document category |
US20090116746A1 (en) * | 2007-11-06 | 2009-05-07 | Copanion, Inc. | Systems and methods for parallel processing of document recognition and classification using extracted image and text features |
US20090116757A1 (en) * | 2007-11-06 | 2009-05-07 | Copanion, Inc. | Systems and methods for classifying electronic documents by extracting and recognizing text and image features indicative of document categories |
US20090116756A1 (en) * | 2007-11-06 | 2009-05-07 | Copanion, Inc. | Systems and methods for training a document classification system using documents from a plurality of users |
US8538184B2 (en) | 2007-11-06 | 2013-09-17 | Gruntworx, Llc | Systems and methods for handling and distinguishing binarized, background artifacts in the vicinity of document text and image features indicative of a document category |
US20110093258A1 (en) * | 2009-10-15 | 2011-04-21 | 2167959 Ontario Inc. | System and method for text cleaning |
WO2011044659A1 (en) * | 2009-10-15 | 2011-04-21 | 2167959 Ontario Inc. | System and method for phrase identification |
US20110093414A1 (en) * | 2009-10-15 | 2011-04-21 | 2167959 Ontario Inc. | System and method for phrase identification |
US8868469B2 (en) | 2009-10-15 | 2014-10-21 | Rogers Communications Inc. | System and method for phrase identification |
US8380492B2 (en) | 2009-10-15 | 2013-02-19 | Rogers Communications Inc. | System and method for text cleaning by classifying sentences using numerically represented features |
US20110135162A1 (en) * | 2009-12-09 | 2011-06-09 | Canon Kabushiki Kaisha | Image processing apparatus, image processing method and recording medium |
US8885931B2 (en) * | 2011-01-26 | 2014-11-11 | Microsoft Corporation | Mitigating use of machine solvable HIPs |
US20120189194A1 (en) * | 2011-01-26 | 2012-07-26 | Microsoft Corporation | Mitigating use of machine solvable hips |
US20130021340A1 (en) * | 2011-07-19 | 2013-01-24 | SAMSUNG ELECTRONICS CO., LTD. of Suwon-si Republic of Korea | Method and apparatus for obtaining lighting information and material information in image modeling system |
US9035964B2 (en) * | 2011-07-19 | 2015-05-19 | Samsung Electronics Co., Ltd. | Method and apparatus for obtaining lighting information and material information in image modeling system |
US12045244B1 (en) | 2011-11-02 | 2024-07-23 | Autoflie Inc. | System and method for automatic document management |
US10204143B1 (en) | 2011-11-02 | 2019-02-12 | Dub Software Group, Inc. | System and method for automatic document management |
US10185765B2 (en) * | 2012-09-06 | 2019-01-22 | Fuji Xerox Co., Ltd. | Non-transitory computer-readable medium, information classification method, and information processing apparatus |
US11238079B2 (en) | 2012-10-31 | 2022-02-01 | Open Text Corporation | Auto-classification system and method with dynamic user feedback |
US10685051B2 (en) | 2012-10-31 | 2020-06-16 | Open Text Corporation | Reconfigurable model for auto-classification system and method |
US12038959B2 (en) | 2012-10-31 | 2024-07-16 | Open Text Corporation | Reconfigurable model for auto-classification system and method |
TWI505207B (en) * | 2014-03-26 | 2015-10-21 | Excellence Inc E | System and Method of Automatically Delivering Electronic Official Documents |
CN105426462A (en) * | 2015-11-13 | 2016-03-23 | 深圳码隆科技有限公司 | Image searching method and device based on image element |
US10839302B2 (en) | 2015-11-24 | 2020-11-17 | The Research Foundation For The State University Of New York | Approximate value iteration with complex returns by bounding |
EP3196811A1 (en) * | 2016-01-20 | 2017-07-26 | Accenture Global Services Limited | Cognitive document reader |
US20170206409A1 (en) * | 2016-01-20 | 2017-07-20 | Accenture Global Solutions Limited | Cognitive document reader |
US11900274B2 (en) | 2016-09-22 | 2024-02-13 | nference, inc. | Systems, methods, and computer readable media for visualization of semantic information and inference of temporal signals indicating salient associations between life science entities |
US10884981B1 (en) | 2017-06-19 | 2021-01-05 | Wells Fargo Bank, N.A. | Tagging tool for managing data |
US11789903B1 (en) | 2017-06-19 | 2023-10-17 | Wells Fargo Bank, N.A. | Tagging tool for managing data |
US10546054B1 (en) * | 2018-02-28 | 2020-01-28 | Intuit Inc. | System and method for synthetic form image generation |
CN111383299A (en) * | 2018-12-28 | 2020-07-07 | Tcl集团股份有限公司 | Image processing method and device and computer readable storage medium |
JP7243286B2 (en) | 2019-02-25 | 2023-03-22 | コニカミノルタ株式会社 | Image forming device and document management system |
JP2020135736A (en) * | 2019-02-25 | 2020-08-31 | コニカミノルタ株式会社 | Image formation apparatus and document management system |
WO2020243846A1 (en) * | 2019-06-06 | 2020-12-10 | Bear Health Technologies Inc. | System and method for automated file reporting |
US11545242B2 (en) | 2019-06-21 | 2023-01-03 | nference, inc. | Systems and methods for computing with private healthcare data |
US11487902B2 (en) | 2019-06-21 | 2022-11-01 | nference, inc. | Systems and methods for computing with private healthcare data |
US11829514B2 (en) | 2019-06-21 | 2023-11-28 | nference, inc. | Systems and methods for computing with private healthcare data |
US11848082B2 (en) | 2019-06-21 | 2023-12-19 | nference, inc. | Systems and methods for computing with private healthcare data |
WO2021011776A1 (en) * | 2019-07-16 | 2021-01-21 | nference, inc. | Systems and methods for populating a structured database based on an image representation of a data table |
US12032546B2 (en) | 2019-07-16 | 2024-07-09 | nference, inc. | Systems and methods for populating a structured database based on an image representation of a data table |
US11556549B2 (en) | 2019-09-05 | 2023-01-17 | Yandex Europe Ag | Method and system for ranking plurality of digital documents |
US11335108B2 (en) | 2020-08-10 | 2022-05-17 | Marlabs Incorporated | System and method to recognise characters from an image |
US20220147843A1 (en) * | 2020-11-12 | 2022-05-12 | Samsung Electronics Co., Ltd. | On-device knowledge extraction from visually rich documents |
US11798258B2 (en) | 2021-05-03 | 2023-10-24 | Bank Of America Corporation | Automated categorization and assembly of low-quality images into electronic documents |
US12033367B2 (en) | 2021-05-03 | 2024-07-09 | Bank Of America Corporation | Automated categorization and assembly of low-quality images into electronic documents |
US11704352B2 (en) | 2021-05-03 | 2023-07-18 | Bank Of America Corporation | Automated categorization and assembly of low-quality images into electronic documents |
US11881041B2 (en) | 2021-09-02 | 2024-01-23 | Bank Of America Corporation | Automated categorization and processing of document images of varying degrees of quality |
WO2024020701A1 (en) * | 2022-07-28 | 2024-02-01 | Wisedocs Inc. | System and method for automated file reporting |
Also Published As
Publication number | Publication date |
---|---|
US20090116757A1 (en) | 2009-05-07 |
WO2009061917A1 (en) | 2009-05-14 |
US20090116746A1 (en) | 2009-05-07 |
US20090119296A1 (en) | 2009-05-07 |
US8538184B2 (en) | 2013-09-17 |
US20090116755A1 (en) | 2009-05-07 |
US20090116756A1 (en) | 2009-05-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8538184B2 (en) | Systems and methods for handling and distinguishing binarized, background artifacts in the vicinity of document text and image features indicative of a document category | |
US8897563B1 (en) | Systems and methods for automatically processing electronic documents | |
US11188713B2 (en) | Extracting structured information from a document containing filled form images | |
US20110249905A1 (en) | Systems and methods for automatically extracting data from electronic documents including tables | |
AU2020200251B2 (en) | Label and field identification without optical character recognition (OCR) | |
JP5073022B2 (en) | Low resolution OCR for documents acquired with a camera | |
CN103995904B (en) | A kind of identifying system of image file electronic bits of data | |
US20210149931A1 (en) | Scalable form matching | |
Christy et al. | Mass digitization of early modern texts with optical character recognition | |
Mariner | Optical Character Recognition (OCR) | |
Barrett et al. | Digital mountain: From granite archive to global access | |
Mehri | Historical document image analysis: a structural approach based on texture | |
Bangera et al. | Digitization Of Tulu Handwritten Scripts-A Literature Survey | |
Hussain et al. | Deep learning-based recognition system for pashto handwritten text: benchmark on PHTI | |
Campbell | Computational Analysis of Documents | |
Semertzidis et al. | Social Media: Trends, Events, and Influential Users | |
Mapari et al. | A Study Of Devnagri Handwritten Character Recognition System | |
Gupta et al. | Automated transfer of information from paper documents to computer-accessible media |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: COPANION, INC., MASSACHUSETTS Free format text: PROPRIETARY INFORMATION AND INVENTIONS AGREEMENT;ASSIGNORS:NEOGI, DEPANKAR;LADD, STEVEN;AHMED, DILNAWAJ;AND OTHERS;SIGNING DATES FROM 20070219 TO 20110727;REEL/FRAME:027581/0585 |
|
AS | Assignment |
Owner name: GRUNTWORX, LLC, NORTH CAROLINA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:COPANION, INC.;REEL/FRAME:027685/0352 Effective date: 20110707 |
|
AS | Assignment |
Owner name: GRUNTWORX, LLC, NORTH CAROLINA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:COPANION, INC.;REEL/FRAME:028157/0982 Effective date: 20110727 |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |