Examples

pdfindexer

Java Library and Tool to Index and search PDF files using Apache Lucene and PDF Box

Documentation

Maven dependency

<!-- Java Library and Tool to Index and search PDF files using Apache Lucene and PDF Box https://www.bitplan.com/PdfIndexer -->
<dependency>
  <groupId>com.bitplan.pdfindex</groupId>
  <artifactId>com.bitplan.pdfindex</artifactId>
  <version>0.0.11</version>
</dependency>

Current release at repo1.maven.org

How to build

git clone https://github.com/WolfgangFahl/pdfindexer
cd pdfindexer
mvn install

Purpose

Index and search for keywords in PDF sources (files and URLs) using Apache Lucene and PDFBox The result will be put in a HTML file - the layout can be modified using a Freemarker template

Integration into Development enviroment

The approach from https://stackoverflow.com/questions/14013644/hosting-a-maven-repository-on-github is used

Examples

see test folder for example input and results see Usage below for how to run pdfindexer from command line

Lorem Ipsum

Source: Lorem Ipsum PDF
Keywords: https://github.com/WolfgangFahl/pdfindexer/blob/master/test/searchwords.txt
Result: Lorem Ipsum PDF Index

java -jar pdfindex.jar --sourceFileList test/pdffiles.lst --idxfile test/index2 --outputfile test/html/pdfindex.html --searchKeyWordList test/searchwords.txt --root test/

resulting html file is in test/html/pdfindex.html

Cajun project

PDF text from the University of Notthingham about how to publish journals using the brand new Adobe technology (written 1993)

Source: https://eprints.nottingham.ac.uk/249/1/cajun.pdf
Keywords: Adobe IBM MS-DOS
Result: Cajun PDF Index

Usage

Directly from jar

java -jar pdfindexer.jar [options]

see usage page below

Usage page

	Pdfindexer Version: 0.0.9
	
	 github: https://github.com/WolfgangFahl/pdfindexer.git
	
	  usage: java com.bitplan.pdfindexer.Pdfindexer
	 --title VAL                  : title to be used in html result
	 -d (--debug)                 : debug
	                                create additional debug output if this switch
	                                is used
	 -e (--autoescape)            : autoescape blanks
		                              set to off if you'd like to use lucene query
		                              syntax		                                
	 -f (--src) VAL               : source url, directory/or file
	 -h (--help)                  : help
	                                show this usage
	 -i (--idxfile) VAL           : index file
	 -k (--keyWords) VAL          : search
	                                comma separated list of keywords to search
	 -l (--sourceFileList) VAL    : path to ascii-file with source urls,directories
	                                or file names
	                                one url/file/directory may be specified by line
	 -m (--maxHits) N             : maximum number of hits per keyword
	 -o (--outputfile) VAL        : (html) output file
	                                the output file will contain the search result
	                                with links to the pages in the pdf files that
	                                haven been searched
	 -p (--templatePath) VAL      : path to Freemarker template file(s) to be used
	                                to format the output
	 -r (--root) VAL              : root
	                                if a  root is specified the paths in the
	                                sourceFileList and in the output will be
	                                considered relative to this root path
	 -s (--silent)                : stay silent
	                                do not create any output on System.out if this
	                                switch is used
	 -t (--templateName) VAL      : name of Freemarker template to be used
	 -v (--version)               : showVersion
	                                show current version if this switch is used
	 -x (--extract)               : extract text
                                extract text content to files	                                
	 -w (--searchKeyWordList) VAL : file with search words

Modifying the template

	 src/main/resources/templates

contains the default freemarker template "defaultindex.ftl". You might want to modify it our create your own template and use the -t/--templateName option to use it.

Version history

Version	date	changes
0.0.3	2013	first published version
0.0.4	2013	adds text extract feature
0.0.5	2014-05-31	fixes template - fixes this README - allows positional command line arguments
0.0.6	2014-08-18	fixes bug - adds Apache License to README - adds github as maven repository
0.0.7	2015-02-13	upgrade to apache pdfbox 1.8.4 to avoid bug https://issues.apache.org/jira/browse/PDFBOX-1541
0.0.8	2015-02-14	switch to NonSeq parser and upgrade to apache pdfbox 1.8.8 to avoid bugs https://issues.apache.org/jira/browse/PDFBOX-1845 and https://issues.apache.org/jira/browse/PDFBOX-2523 a
0.0.9	2015-02-14	and for good measure apache pdfbox 1.8.9 to avoid https://issues.apache.org/jira/browse/PDFBOX-2579
0.0.10	2017-04-28	upgrades pdfbox to 1.8.13
0.0.11	2018-08-22	fixes #4, fixes #5, upgrades Java to Java8 uses com.bitplan.pom

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
docs		docs
src		src
test		test
.classpath		.classpath
.gitignore		.gitignore
.project		.project
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
pdfindex.jar		pdfindex.jar
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pdfindexer

Documentation

Maven dependency

How to build

Purpose

Integration into Development enviroment

Examples

Lorem Ipsum

Cajun project

Usage

Directly from jar

Usage page

Modifying the template

Version history

About

Releases

Packages

Languages

License

WolfgangFahl/pdfindexer

Folders and files

Latest commit

History

Repository files navigation

pdfindexer

Documentation

Maven dependency

How to build

Purpose

Integration into Development enviroment

Examples

Lorem Ipsum

Cajun project

Usage

Directly from jar

Usage page

Modifying the template

Version history

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages