Collections-OCR

This repo is archived as of March 2023. The project is not active and the GoogleCloud API + cloudyr R-wrapper has since changed. For more information contact Todd Widhelm

A few scripts that batches of collections label-images through OCR

Google Cloud Vision API & `ocrCloudVision.R`

Dependencies

Make sure to install these libraries first:

googleCloudVisionR - to do OCR magic
Other dependencies include readr, tidyr, stringr for data handling

How to run `ocrCloudVision.R`:

Notes:

This currently uses Google's Cloud Vision API, which requires:
- Being aware of pricing & quotas for the Google Vision API
- Setting up a project on Google Cloud Platform
- Authenticating your magine by setting up a Service account & key
  - Get help from the cloudyr repo for googleCloudVisionR
This can takes over 30 seconds per label-image.
- Be mindful how many images you add to your "images" directory.
- Be mindful of your internet connection speed
- Keep image sizes under 20MB (Overall, smaller image files transfer and process more quickly)
Output likely needs some [or many] follow-up/clean-up steps.
- Batch similar images together to streamline follow-up steps.

To run the script:

Add a folder named "images" to this script's directory
Add the images (JPG & JPEG) you'd like to OCR to that directory
Run the script (Rscript ocrCloudVision.R)

Output from `ocrCloudVision.R`:

A CSV named "ocrText-[Date-time].csv", containing these columns:

"image" = filename for each JPG and JPEG
"imagesize" = filesize for each image (in MB)
"ocr_start" = start-date and time when an image was submitted to the Google Vision API
"ocr_duration" = duration (in seconds) of the OCR process
"line_count" = number of lines in each OCR transcription
"Line1" - "Line[N]" = text for each line in the OCR transcription of an image.
- the number of "Line" columns will match the maximum number of lines as needed.

Tesseract & `ocrMangle.R`

Dependencies

Make sure to install these libraries first:

magick - to read in image files
tesseract - to do OCR magic
stringr - to split the OCR'ed lines to columns

How to run `ocrMangle.R`:

Notes:

This can takes over 10 seconds per label-image.
- Be mindful how many images you add to your "images" directory.
This currently uses Tesseract's English ("eng"), German ("deu"), and Latin ("lat") libraries.
Output likely needs some [or many] follow-up/clean-up steps.
- Batch similar images together to streamline follow-up steps.

To run the script:

Add a folder named "images" to this script's directory
Add the images (JPG & JPEG) you'd like to OCR to that directory
Run the script (Rscript ocrMangle.R)

Output from `ocrMangle.R`:

A CSV named "ocrText-[Date-time].csv", containing these columns:

"image" = filename for each JPG and JPEG
"line_count" = number of lines in each OCR transcription
"Line1" - "Line[N]" = text for each line in the OCR transcription.
- the number of "Line" columns will match the maximum number of lines as needed.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
images		images
.gitignore		.gitignore
Collections-OCR.Rproj		Collections-OCR.Rproj
LICENSE		LICENSE
README.md		README.md
ocrCloudVision.R		ocrCloudVision.R
ocrGoogleDrive.R		ocrGoogleDrive.R
ocrMangle.R		ocrMangle.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Collections-OCR

Google Cloud Vision API & `ocrCloudVision.R`

Dependencies

How to run `ocrCloudVision.R`:

Output from `ocrCloudVision.R`:

Tesseract & `ocrMangle.R`

Dependencies

How to run `ocrMangle.R`:

Output from `ocrMangle.R`:

Google Drive API & `ocrGoogleDrive.R`

This is drafty; might work for small batches, but needs work.

About

Releases

Packages

Languages

License

fieldmuseum/Collections-OCR

Folders and files

Latest commit

History

Repository files navigation

Collections-OCR

Google Cloud Vision API & ocrCloudVision.R

Dependencies

How to run ocrCloudVision.R:

Output from ocrCloudVision.R:

Tesseract & ocrMangle.R

Dependencies

How to run ocrMangle.R:

Output from ocrMangle.R:

Google Drive API & ocrGoogleDrive.R

This is drafty; might work for small batches, but needs work.

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Google Cloud Vision API & `ocrCloudVision.R`

How to run `ocrCloudVision.R`:

Output from `ocrCloudVision.R`:

Tesseract & `ocrMangle.R`

How to run `ocrMangle.R`:

Output from `ocrMangle.R`:

Google Drive API & `ocrGoogleDrive.R`

Packages