Skip to content

A tool for making full e-books from pages digitized by Czech National Library

License

Notifications You must be signed in to change notification settings

nextghost/erben

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Erben

A web-based tool for creating full e-books from single pages digitized by Czech national Library.

How it should work

First, this system will import the entire catalog of books from digitized archives of Czech National Library. The CzNL metadata archive is available at https://kramerius.nkp.cz/kramerius/oai (see https://www.openarchives.org/OAI/openarchivesprotocol.html for protocol specification).

Next, an admin will select a batch of books to work on, the system will download their contents and convert DjVu images of digitized pages to PNG for viewing through web browser without any additional plugins. Storing the contents of all books in PNG format would require nearly 20 TB of storage space so only a handful of books will be open for work at any time. When a book is finished, PNG images will be deleted to free space for other books but the full history of text data will be kept.

Users will then edit the OCR text wiki-style to clean any mistakes in it, add basic formatting markup and vote for pages that they believe are clean and finished.

As the last step, the cleaned text of all pages in each book will be combined together and exported for advanced formatting and conversion into an actual e-book format (PDF, epub etc.).

System requirements

  • Linux/BSD
  • PostgreSQL database
  • ImageMagick
  • PHP 5.3 or newer with the following extensions:
    • curl
    • date
    • dom
    • libxml
    • pcntl (command line only, not listed as module in phpinfo() page)
    • pcre
    • PDO
    • pdo_pgsql
    • posix
    • xml

Installation instructions

  1. Write database connection details into config/config.php.example and rename it to config/config.php
  2. Run install.php on your web server.
  3. (Testing only) Run testjobs.php from command line to register Czech National Library OAI repository in Erben and create small import job.
  4. (Testing only) Run worker.php from command line to execute the job and import a few book titles for testing (metadata only, no page content). The script will return immediately but the job will run in a background process for a few minutes.

TODO

  • Implement background task processor
    • Harvest job generator
    • Book harvester
    • Page import job generator
    • Single page importer
  • Implement web interface
    • Account and session management
    • List of authors and books
    • Book detail page
    • Book page editor
    • Book page image display
    • User wishlist and book popularity statistics
    • User contribution leaderboard
    • System administration
      • Job and worker process management
      • Repository and harvesting management
      • Book content import management
    • Duplicate author entry merging tool

About

A tool for making full e-books from pages digitized by Czech National Library

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published