Skip to content

Generate searchable pdf documents from scanned documents with Amazon Textract

License

Notifications You must be signed in to change notification settings

aws-samples/amazon-textract-searchable-pdf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Generate Searchable PDF documents with Amazon Textract

This repository contains sample library and code examples showing how Amazon Textract can be used to extract text from documents and generate searchable pdf documents.

How is searchable PDF generated

To generate a searchable PDF, we use Amazon Textract to extract text from documents and then add extracted text as a layer to the image in the PDF document. Amazon Textract detect and analyze text input documents and returns information about detected items such as pages, words, lines, form data (key-value pairs), tables, selection elements etc. It also provides bounding box information which is an axis-aligned coarse representation of the location of the recognized item on the document page. We use detected text and its bounding box information to appropriately place text in the pdf page.

SampleInput.pdf is an example input document where text is locked inside the image. SampleOutput.pdf is an example of a searchable pdf document where you can select and copy text and search within the document.

PDFDocument library wraps all the necessary logic to generate searchable PDF document using output from Amazon Textract. It also uses open source Java library Apache PDFBox to create the PDF document but there similar pdf processing libraries available in other programming languages.

    ...
    
    //Extract text using Amazon Textract
    List<TextLine> lines = extractText(imageBytes);
        
    //Create new pdf document
    PDFDocument pdfDocument = new PDFDocument();

    //Add page with text layer and image in the pdf document
    pdfDocument.addPage(image, imageType, lines);
    
    //Save PDF to local disk
    try(OutputStream outputStream = new FileOutputStream(outputDocumentName)) {
        pdfDocument.save(outputStream);
        pdfDocument.close();
    }

Code examples

Sample project has five different examples:

Run code examples on local machine

  1. Setup AWS Account and AWS CLI using getting started with Amazon Textract.
  2. Download and unzip the sample project.
  3. Install Apache Maven if it is not already installed.
  4. In the project directory run "mvn package".
  5. Run: "java -cp target/searchable-pdf-1.0.jar Demo" to run Java project with Demo as main class.

By default only first example to create searchable PDF from image on local drive is enabled. Uncomment relevant lines in Demo to run other examples.

Run code examples in AWS Lambda

  1. Download and unzip the sample project.
  2. Install Apache Maven if it is not already installed.
  3. In the project directory run "mvn package".

The build creates .jar in project-dir/target/searchable-pdf1.0.jar, using information in the pom.xml to do the necessary transforms. This is a standalone .jar (.zip file) that includes all the dependencies. This is your deployment package that you can upload to AWS Lambda to create a Lambda function. DemoLambda has all the necessary code to read S3 events and take action based on the type of input document.

  1. Create an Amazon S3 bucket.

  2. Create a folder “documents” in Amazon S3 bucket.

  3. Create an AWS Lambda with Java 17 and IAM role that has read and write permissions to S3 bucket you created earlier.

  4. Configure the IAM role to have permissions to call Amazon Textract.

  5. Set handler to "DemoLambda::handleRequest".

  6. Increase timeout to 5 minutes.

  7. Upload jar file you build earlier.

  8. Add a trigger in the Lambda function such that when an object is uploaded to the folder “documents” in your Amazon S3 bucket, Lambda function gets executed.

Make sure that you set trigger for “documents” folder. If you add trigger for the whole bucket then Lambda will trigger every time an output pdf document is generated resulting in cycle.

  1. Upload an image (jpeg, png) or pdf document to documents folder in your Amazon S3 bucket.

In few seconds you should see searchable pdf document generated in the S3 bucket.

These steps show simple Amazon S3 and Lambda integration. In production you should consider scalable architecture similar to this reference architecture.

Cost

  • As you run these samples they call different Amazon Textract APIs in your AWS account. You will get charged for all the API calls made as part of the analysis.

Other Resources

License

This library is licensed under the MIT-0 License. See the LICENSE file.

About

Generate searchable pdf documents from scanned documents with Amazon Textract

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages