Java Library and Tool to Index and search PDF files using Apache Lucene and PDF Box
Maven dependency
<!-- Java Library and Tool to Index and search PDF files using Apache Lucene and PDF Box https://www.bitplan.com/PdfIndexer -->
<dependency>
<groupId>com.bitplan.pdfindex</groupId>
<artifactId>com.bitplan.pdfindex</artifactId>
<version>0.0.11</version>
</dependency>
Current release at repo1.maven.org
git clone https://github.com/WolfgangFahl/pdfindexer
cd pdfindexer
mvn install
Index and search for keywords in PDF sources (files and URLs) using Apache Lucene and PDFBox The result will be put in a HTML file - the layout can be modified using a Freemarker template
- The approach from https://stackoverflow.com/questions/14013644/hosting-a-maven-repository-on-github is used
see test folder for example input and results see Usage below for how to run pdfindexer from command line
-
Source: Lorem Ipsum PDF
-
Keywords: https://github.com/WolfgangFahl/pdfindexer/blob/master/test/searchwords.txt
-
Result: Lorem Ipsum PDF Index
java -jar pdfindex.jar --sourceFileList test/pdffiles.lst --idxfile test/index2 --outputfile test/html/pdfindex.html --searchKeyWordList test/searchwords.txt --root test/
resulting html file is in test/html/pdfindex.html
PDF text from the University of Notthingham about how to publish journals using the brand new Adobe technology (written 1993)
- Source: https://eprints.nottingham.ac.uk/249/1/cajun.pdf
- Keywords: Adobe IBM MS-DOS
- Result: Cajun PDF Index
java -jar pdfindexer.jar [options]
see usage page below
Pdfindexer Version: 0.0.9
github: https://github.com/WolfgangFahl/pdfindexer.git
usage: java com.bitplan.pdfindexer.Pdfindexer
--title VAL : title to be used in html result
-d (--debug) : debug
create additional debug output if this switch
is used
-e (--autoescape) : autoescape blanks
set to off if you'd like to use lucene query
syntax
-f (--src) VAL : source url, directory/or file
-h (--help) : help
show this usage
-i (--idxfile) VAL : index file
-k (--keyWords) VAL : search
comma separated list of keywords to search
-l (--sourceFileList) VAL : path to ascii-file with source urls,directories
or file names
one url/file/directory may be specified by line
-m (--maxHits) N : maximum number of hits per keyword
-o (--outputfile) VAL : (html) output file
the output file will contain the search result
with links to the pages in the pdf files that
haven been searched
-p (--templatePath) VAL : path to Freemarker template file(s) to be used
to format the output
-r (--root) VAL : root
if a root is specified the paths in the
sourceFileList and in the output will be
considered relative to this root path
-s (--silent) : stay silent
do not create any output on System.out if this
switch is used
-t (--templateName) VAL : name of Freemarker template to be used
-v (--version) : showVersion
show current version if this switch is used
-x (--extract) : extract text
extract text content to files
-w (--searchKeyWordList) VAL : file with search words
src/main/resources/templates
contains the default freemarker template "defaultindex.ftl". You might want to modify it our create your own template and use the -t/--templateName option to use it.
Version | date | changes |
---|---|---|
0.0.3 | 2013 | first published version |
0.0.4 | 2013 | adds text extract feature |
0.0.5 | 2014-05-31 | fixes template - fixes this README - allows positional command line arguments |
0.0.6 | 2014-08-18 | fixes bug - adds Apache License to README - adds github as maven repository |
0.0.7 | 2015-02-13 | upgrade to apache pdfbox 1.8.4 to avoid bug https://issues.apache.org/jira/browse/PDFBOX-1541 |
0.0.8 | 2015-02-14 | switch to NonSeq parser and upgrade to apache pdfbox 1.8.8 to avoid bugs https://issues.apache.org/jira/browse/PDFBOX-1845 and https://issues.apache.org/jira/browse/PDFBOX-2523 a |
0.0.9 | 2015-02-14 | and for good measure apache pdfbox 1.8.9 to avoid https://issues.apache.org/jira/browse/PDFBOX-2579 |
0.0.10 | 2017-04-28 | upgrades pdfbox to 1.8.13 |
0.0.11 | 2018-08-22 | fixes #4, fixes #5, upgrades Java to Java8 uses com.bitplan.pom |