XQuery 3.0 (java bindings) module for exposing Apache Tika parsing capabilities to xquery. Tika currently supports over a 1000 file types including popular office formats.
Use xqpm to do it for you!
xqpm xq-tika
-
Download the latest verison of the Tika-app.jar file.
-
Add the file to your class path or if using BaseX simply add the file to the BaseX\lib folder.
Note in Windows: When launching BaseX as the GUI. Ensure to use the batch files located in BaseX\Bin folder, as opposed to the gui executable. The batch files ensure all jar files in the lib folder are added to the class path.
- Clone this repository to your local machine and import the xq-tika.xqm module into your project.
The xq-tika module currently exposes two core methods: parse
and parse-lines
.
Upon execution, the type of file is automatically detected with text contents returned utilizing the Tika libraries.
parse($path as xs:string) as xs:string
parse-lines($path as xs:string) as xs:string*
To support large files, and reduce memory footprint, a max string length can be specified resulting in the document only being parsed up to the length specified.
parse($filePath as xs:string, $maxStringLength as xs:string) as xs:string
parse-lines($filePath as xs:string, $maxStringLength as xs:string) as xs:string*
import module namespace tika = "https://xq-tika";
tika:parse('c:\my-word-document.doc')
If you like what you see here please star the repo and follow me on github or linkedIn
Happy Parsing!