Skip to content
jerrygao edited this page Jan 28, 2020 · 78 revisions

JATE2.0 is a modular, scalable, configurable and ready-to-use term extraction tool and development framework. It integrates with Apache Solr, and provides a number general-purpose plugins for text mining. Currently, JATE2.0 doesn't support multiple languages directly. Instead, it provides a framework so that various language specific plugins can be developed and configured.

This page explains several examples to get you started as quick as possible. Two youtube videos (Embedded Mode Demo and Plugin Mode Demo) are also available to help you quickly go through this tutorial.

Step 1 - Install JDK 1.8 or later

JATE2.0 is based on JDK 1.8. Please download & install JAVA 1.8 or later.

Step 2 - Install Apache Solr (Optional)

JATE2.0 is based on Apache Solr 7.2.1. Please download & install . This is optional if you don't need to run JATE within your external Solr server (as in the Plugin mode).

Step 3 - Get JATE

Next, you need to download latest release of JATE2.0 via Nexus Repository or Git.

Step 4 - Setup libraries (Optional)

JATE uses Maven to manage its libraries. If you are using JATE as a library in your project, the only one library that is not available in the Maven central repository is Dragontools. For this, either manually configure it in your IDE, or run mvn install to install it in your local maven repository properly.

Used version of dragontool from edu.drexel doesn't exist at maven central. Alternatively, the same version exists at de.julielab (credit to @catap) if you would like to configure JATE as external library in your project.

Maven setup of JATE2.0 is as following, if you are using JATE as a library:

<dependency>
    <groupId>uk.ac.shef.dcs</groupId>
    <artifactId>jate</artifactId>
    <version>2.0-beta.11</version>
</dependency>

See more details in Using-JATE

Step 5 - Set-up and configuration of Solr Core

JATE2.0 requires a Solr core's instance directory for indexing documents and storing term statistics. JATE2.0 testbed contains several examples of Solr Core configuration for different corpora. Note: document unique id field is mandatory and the field name MUST be 'id'.

Step 6 - Configure jate.properties in your class path (Optional)

jate.properties defines a number of properties you may need to configure JATE2.0 to work with your Solr instance. However if you use the default Solr instance configuration, you should not need to provide this file and this will automatically be loaded from class path.

Step 7 - Run JATE2.0 as a standalone application (Embedded mode)

Next,to run an ATE program for your corpus, you can just fire up the JATE standalone jar file from following the command line as below. Running JATE2.0 in embedded mode means that you don't have to install Apache Solr (i.e., skip Step 2).:

java -cp jate-2.0-*-jar-with-dependencies.jar <APP_ALGORITHM> <OPTIONS> <SOLR_HOME_PATH> <SOLR_CORE_NAME>

For example, to analyse a given corpus with CValue algorithm and the default Solr setting using the example core of "ACLRDTEC", you can run the following program from the command line :

java -cp <PATH>/jate-2.0-*-jar-with-dependencies.jar uk.ac.shef.dcs.jate.app.AppCValue -corpusDir <CORPUS_DIR> -o cvalue-terms.json <JATE_HOME>/testdata/solr-testbed ACLRDTEC

Then, to rank and output weighted terms with a different algorithm (e.g., TF-IDF), you can simply change a different algorithm:

java -cp <PATH>/jate-2.0-*-jar-with-dependencies.jar uk.ac.shef.dcs.jate.app.AppTFIDF -o tfidf-terms.json <JATE_HOME>/testdata/solr-testbed ACLRDTEC

The algorithms that can be run as standalone applications:

Algorithm APP_ALGORITHM
TTF uk.ac.shef.dcs.jate.app.AppTTF
ATTF uk.ac.shef.dcs.jate.app.AppATTF
TTF-IDF uk.ac.shef.dcs.jate.app.AppTFIDF
RIDF uk.ac.shef.dcs.jate.app.AppRIDF
CValue uk.ac.shef.dcs.jate.app.AppCValue
ChiSquare uk.ac.shef.dcs.jate.app.AppChiSquare
RAKE uk.ac.shef.dcs.jate.app.AppRAKE
Weirdness uk.ac.shef.dcs.jate.app.AppWeirdness
GlossEx uk.ac.shef.dcs.jate.app.AppGlossEx
TermEx uk.ac.shef.dcs.jate.app.AppTermEx
Basic uk.ac.shef.dcs.jate.app.Basic
ComboBasic uk.ac.shef.dcs.jate.app.ComboBasic

Run-time parameters options for the standalone application:

options Expected Type description
-corpusDir string The directory of the corpus that will be processed.
-prop string jate.properties file(path) for the configuration of Solr schema.
-c boolean Expect 'true' or 'false'. This parameter specifies whether to collect term information for exporting, e.g., offsets in documents. Default is false. Setting to true will significantly increase post-processing time that is need to query the Solr index for such information.
-r string Reference corpus frequency file (path) is required by AppGlossEx, AppTermEx and AppWeirdness. An example is provided in '/testdata/solr-testbed/ACLRDTEC/conf/bnc_unifrqs.normal'.
-cf.t number This is a post-filtering setting. Cutoff score threshold for selecting terms. If multiple -cf.* parameters are set the preference order will be cf.t, cf.k, cf.kp.
-cf.k number This is a post-filtering setting. Cutoff top ranked K terms to be selected. If multiple -cf.* parameters are set the preference order will be cf.t, cf.k, cf.kp.
-cf.kp number This is a post-filtering setting. Cutoff top ranked K% terms to be selected. If multiple -cf.* parameters are set the preference order will be cf.t, cf.k, cf.kp.
-pf.mttf number Pre-filter minimum total term frequency. Any candidate term whose total frequency in the corpus is less than this value will not be considered for ranking
-pf.mtcf number Pre-filter minimum context frequency of a term (used by co-occurrence based methods). This is the number of context objects where a term appears. If any candidate's mtcf is lower than this value it will not be considered for ranking
-o string File (path) to save output. Only JSON output is supported now.

You can also run JATE2.0 as SolrPlugins in your own Solr server (only supported since Beta version). Recommended setting in an individual SolrCore is as follows:

Step 8.1 add JATE2.0 jars in Solr server

Create a new folder 'jate' in a lib or contrib directory in the instanceDir of your SolrCore. Then, place JATE 2.0 jars (simply use the jate-2.0-**-dependencies.jar) in the $SOLR_HOME/contrib/jate/lib folder.

Step 8.2 config JATE2.0 jars in Solr core

Assuming you have created a Solr core called 'jate'. This should be a folder '$SOLR_HOME/solr/jate', and contain content similar to the two sample cores provided (e.g., testbed). Next configure JATE jars path in your '$SOLR_HOME/solr/jate/conf/solrconfig.xml'. An example setting is as follows:

<lib dir="${solr.install.dir:../../..}/contrib/jate/lib" regex=".*\.jar" />

You may also refer to how to add custom plugins in SolrCloud mode.

Step 8.3 config term recognition request handler in Solr core

Candidate extraction and term scoring are two separate processes. The former is performed automatically at index-time and the latter needs be triggered separately. In JATE2.0, the scoring can be triggered by a HTTP request. Term recognition request handler needs be configured in your solrconfig.xml so that candidate terms (processed in document indexing time) can be scored, ranked, filtered and exported. The final selected terms are saved in a field that by default is called jate_domain_terms, defined in your schema. This can be changed in jate.properties.

In addition to run-time parameter options (listed above), the following parameters can be configured for the request handler:

options Expected Type Is Required description
algorithm string Y The ATE algorithm that are used to weight candidate terms. For accepted values, please refer to the algorithms listed above.
extraction boolean N Set true or false to determine whether candidate terms will be (re)extracted from current index. Default as false. Essentially, it is a re-indexing process. For example, you can set true to try out different term PoS sequence patterns or a pre-filtering setting. Note: be aware of use RELOAD for Configuration_Changes_In_Solr
indexTerm boolean N Set true or false to determine whether filtered candidate terms will be indexed and stored (e.g., for supporting faceted navigation/search). This requires corresponding solr field to be configured in schema if set to true. The value is false by default. Indexing filtered terms with boosting is only available in plugin mode in current version.
boosting boolean N Set true or false to determine whether term score will be used as boosting value for indexing filtered terms. The value is false by default. You'll need to turn on 'omitNorms' (set to 'false') in the jate_domain_terms field of your schema before you set the boosts. Enabling boosting requires more memory. Warning: Only works in Beta.1 version (supports 5.x) because that index-time boosts are not supported any more since Solr 6.5

An example setting is as follows:

 <requestHandler name="/termRecogniser" class="uk.ac.shef.dcs.jate.solr.TermRecognitionRequestHandler">
    <lst name="defaults">
       <str name="algorithm">CValue</str>
       <bool name="extraction">false</bool>
       <bool name="indexTerm">true</bool>
       <bool name="boosting">false</bool>
       <str name="-prop"><YOUR_PATH>/resource/jate.properties</str>
       <float name="-cf.t">0</float>
       <str name="-o"><YOUR_PATH>/industry_terms.json</str>
     </lst>
 </requestHandler>

Step 8.4 configure jate solr fields in schema.xml

To make sure JATE2.0 works with your indexing engine, you need to configure your schema.xml properly. Two solr content analysis fields (ngram field and term candidate field) are mandatory. Please refer to jate_text_2_ngrams and jate_text_2_terms in schema.xml of the sample instance cores for example.

To enable JATE2.0 to work with Tika plugin, you need to make sure that (1) your Tika requestHandler inside solrconfig.xml defines fmap.content to map to the text field defined in your schema.xml; (2) your schema.xml defines the text field to copy the two required fields by JATE2.0; (3) your text field must have set indexed="true" stored="true"; (4) in terms of boosting, your setting for the jate_domain_terms field in your schema is compatible with your setting for the requestHandler in your solrconfig.

An example setting is shown below:

----------- solrconfig.xml --------------------
<!-- Solr Cell Update Request Handler in solrconfig.xml
       http:https://wiki.apache.org/solr/ExtractingRequestHandler 
    -->
<requestHandler name="/update/extract" 
                  startup="lazy"
                  class="solr.extraction.ExtractingRequestHandler" >
   <lst name="defaults">
     ...
     <str name="fmap.content">text</str>
     <bool name="boosting">false</bool>
     ... 
   </lst>
   ...
</requestHandler>
----------- schema.xml-----------
...
<types>
  ...
  <fieldType name="jate_text_2_ngrams" class="solr.TextField" positionIncrementGap="100"> ... </fieldType>
  <fieldType name="jate_text_2_terms" class="solr.TextField" positionIncrementGap="100"> ... </fieldType>
  ...
</types>
<fields>
  ...
  <field name="text" type="text_general" indexed="true" stored="true" multiValued="true"/>
  ...
  <!-- Field to index text with n-gram tokens-->
  <field name="jate_ngraminfo" type="jate_text_2_ngrams" indexed="true" stored="false" multiValued="false"
               termVectors="true" termPositions="true" termOffsets="true"
               termPayloads="true"/>
  <!-- Field to index text with candidate terms. --> 
  <field name="jate_cterms" type="jate_text_2_terms" indexed="true" stored="false" multiValued="false"
               termVectors="true"/>
  <field name="jate_domain_terms" type="string" indexed="true" stored="true" omitNorms="true" required="false" multiValued="true"/>
  <field name="text" type="text_general" indexed="true" stored="true" multiValued="true"/>	
  <copyField source="text" dest="jate_cterms" />	
  <copyField source="text" dest="jate_ngraminfo" />	
</fields>

Step 8.5 Upload documents (Optional)

Now, you are able to upload documents with Tika plugin. Term candidates will be extracted and indexed at document index time.

For example, you can use JATE2.0 to analyse your local archive by directly uploading documents with Solr POST tool

#!/bin/sh
<SOLR_HOME>/bin/post -c <CORE_NAME> -host:<HOST_NAME> -p:<PORT_NO> <CORPUS_DIR>

Alternatively, you can alter algorithm to analyse your content with a different ATE algorithm.

Step 8.5 term scoring, ranking, filtering, indexing, storing and exporting triggered by a HTTP request

Please note that in plugin mode, term scoring process is separated from candidate extraction (at index-time) and the scoring and filtering process can be triggered by sending a HTTP request to Solr. Candidate extraction can be enabled as an option (by setting 'extraction' to true in request handler config).

For example, with the setting above, sending a POST request will export final ranked & filtered terms into a json file for further analysis:

$ curl -X POST http:https://localhost:8983/solr/jateCore/termRecogniser

Live demo of JATE2.0 for knowledge retrieval

For the live demo in LREC 2016 conference, Please refer to more details in jateSolrPluginDemo.