Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
.github/workflows		.github/workflows
.phan		.phan
assets		assets
bin		bin
config		config
docker		docker
i18n		i18n
public		public
src		src
templates		templates
tests		tests
.env		.env
.gitignore		.gitignore
.minus-x.json		.minus-x.json
.nvmrc		.nvmrc
.phpcs.xml		.phpcs.xml
.stylelintrc.json		.stylelintrc.json
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
Gruntfile.js		Gruntfile.js
LICENSE		LICENSE
README.md		README.md
check_tesseract.sh		check_tesseract.sh
composer.json		composer.json
composer.lock		composer.lock
package-lock.json		package-lock.json
package.json		package.json
phpcs.xml.dist		phpcs.xml.dist
phpunit.xml.dist		phpunit.xml.dist
symfony.lock		symfony.lock
toolinfo.json		toolinfo.json
webpack.config.js		webpack.config.js

Repository files navigation

Wikisource Google OCR tool

Main documentation: https://wikisource.org/wiki/Wikisource:Google_OCR

This is a simple wrapper service around the Google Cloud Vision API, enabling Wikisources to submit images for Optical Character Recognition and retrieve the resultant text.

This works with more languages than the alternative service at https://tools.wmflabs.org/phetools (used by e.g. https://wikisource.org/wiki/MediaWiki:OCR.js and similar scripts on other Wikisources).

Requests can only be for images hosted on Commons.

Usage

Send up to two parameters to api.php:

https://example.org/api.php?langs[]=[LANG_CODE_1]&langs[]=[LANG_CODE_2]&image=[IMAGE_URL]

And get back a JSON response with either 'text' or 'error' top-level items set:

{
  'text': 'Lorem ipsum...',
  'error': {
    'code': '',
    'message': ''
  }
}

Languages

Google

Note that you should only set the lang parameter for languages that require it. The documentation informs us of the following:

In most cases, an empty value yields the best results since it enables automatic language detection. For languages based on the Latin alphabet, setting languageHints is not needed. In rare cases, when the language of the text in the image is known, setting a hint will help get better results (although it will be a significant hindrance if the hint is wrong). Text detection returns an error if one or more of the specified languages is not one of the supported languages.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wikisource Google OCR tool

Usage

Languages

Google

Tesseract

Contributing

External links

About

Releases 36

Contributors 11

Languages

License

wikimedia/wikimedia-ocr

Folders and files

Latest commit

History

Repository files navigation

Wikisource Google OCR tool

Usage

Languages

Google

Tesseract

Contributing

External links

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases 36

Contributors 11

Languages