Skip to content

Information retrieval system, search engine, document classification, machine learning

License

Notifications You must be signed in to change notification settings

tonybeltramelli/Information-Retrieval-System

Repository files navigation

Information-Retrieval-System

Student project Autumn 2014 - ETH Zurich

The project consists of two main parts detailed respectively in this file:

  • Query engine
  • Document classifier

#Query Engine

##Usage

Usage: 
 <root path> <use language model> <only last 10 queries> <document number> <queries>(optional)
 <root path> : String, path to the tipster data folder
 <use language model> : Boolean, true = language-based model / false = term-based model
 <only last 10 queries> : Boolean, true = only use the last test queries / false = use all the provided topics queries
 <document number> : Int, number of document in the stream to process, -1 for all
 <queries> : String, possibility to input custom queries such as "query1" "query2" ... "queryN" / if not defined the topics queries are used as input

##Architecture

com.tonybeltramelli.desktop.
    core.
        perf.
            PerformanceAssessor.scala : wrapper used to assess precision, recall and mean average precision
            PrecisionRecallAssessor.scala : extends ch.ethz.dal.tinyir.lectures.PrecisionRecall, used by PerformanceAssessor to assess performance
        scoring.
            AScoring.scala : trait use to encapsulate scoring logic
            LanguageBasedScoring.scala : extends AScoring, use of maximum likelihood estimation and Jelinek-Mercer smoothing to score
            TermBasedScoring.scala : extends AScoring, use of square root linear transformation on term frequencies
    util.
        Helper.scala : contains constants value and utils functions
    Main.scala : entry point, algorithm flow and topic loading

##Miscellaneous

Optimizations :

  • Stemming using a cache system to increase performance and space consumption
  • Compute collection frequency, collection frequencies sums, term frequency and term frequencies sums for all documents first (method "feed" in AScoring.scala) and compute score for specific query on demand afterwards (overriden method "getScore" in LanguageBasedScoring.scala and TermBasedScoring.scala)
  • For term frequency computation, logarithmic transformation and augmented term frequency have also been experimented (see TermBasedScoring.scala) but square root linear transformation have been prefered for performance / precision reasons
  • For Jelinek-Mercer smoothing, a lambda value of 0.1 have been chosen according to Chengxiang Zhai and John Lafferty, A Study of Smoothing Methods for Language Models Applied to Information Retrieval. (2004) ACM Transactions on Information Systems (TOIS)

Run with VM arguments :
-Xss200m
-Xmx2048m

##Dependencies

Implemented and tested for Scala 2.11.2, JavaSE-1.7, jvm-1.6

Library TinyIR.jar :

ch.ethz.dal.tinyir.
            processing.
                StringDocument.scala
                Document.scala
                Tokenizer.scala
                XMLDocument.scala
                SaxParsing.scala
                StopWords.scala
                TipsterParse.scala
            io.
                ZipDirStream.scala
                DirStream.scala
                ParsedXMLStream.scala
                TipsterStream.scala
                ZipStream.scala
                DocStream.scala
            lectures.
                PrecisionRecall.scala
                TipsterGroundTruth.scala
            alerts.
                AlertsTipster.scala
                Alerts.scala
                Query.scala
            util.
                StopWatch.scala
            indexing.
                FreqIndex.scala
                SimpleIndex.scala
                InvertedIndex.scala

Stemmer class :

com.github.aztek.porterstemmer.PorterStemmer.scala

#Document Classifier

Document classifiers implemented :

  • Naive Bayes
  • Logistic Regression
  • Support Vector Machines

##Usage

Usage: 
 <root path> <classifier> <document number>
 <root path> : String, path to the root folder
 <classifier> : Int, 1 => Naive Bayes / 2 => Logistic Regression / 3 => Support Vector Machines
 <document number> : Int, number of document to process, -1 for all

##Classifiers

Package com.tonybeltramelli.desktop.core.classifier

  • binary.ABinaryLinearClassifier : trait to encapsulate behavior of binary linear classifiers such as training method, vector operations and sigmoid function which are common to both Logistic Regression and Support Vector Machines.
  • binary.LogisticRegressionBinary : extends ABinaryLinearClassifier and override the gradient method.
  • binary.SupportVectorMachinesBinary : extends ABinaryLinearClassifier and override the gradient method with the Pegasos algorithm.
  • multiclass.AClassifier : trait to encapsulate data preprocessing (compute tf-idf, map classes to documents), interface to main class (train and apply methods), result sorting and pruning.
  • multiclass.AMultinomialLinearClassifier : extends AClassifier and add features to support multiclass classification for binary linear classifiers implementing ABinaryLinearClassifier. The document features are the term frequency-inverse document frequency of the most important terms of the documents. Please refer to the Optimizations section for how such important terms are selected.
  • LogisticRegression : extends AMultinomialLinearClassifier and pass LogisticRegressionBinary instances to its super class.
  • NaiveBayes : extends AClassifier because the implementation directly support multiclass classification (implementation inspired by Manning et al. "Introduction to Information Retrieval" multinomial Naive Bayes pseudocode). Naive Bayes implementation using term frequency and Laplace smoothing to improve classification.
  • SupportVectorMachines : extends AMultinomialLinearClassifier and pass SupportVectorMachinesBinary instances to its super class.

##Optimizations

  • Stemming using a cache system to increase performance and space consumption and remove stop words from document before stemming to eliminate semantically non-selective words : StopWordStemmer.
  • Only the n most important terms of each document are kept to minimize memory usage. That is, the terms with the highest term frequency _getTermFreq in AClassifier.
  • Use of 2 linked mutable Map data structure classesToDoc and documents in AClassifier for fast class to document mapping, fast tf tf-idf computation (_computeTermFreqInverseDocumentFreq) and fast document feature retrieving. Int keys are used instead of String to save space.
  • The binary classifiers are trained with a subset of documents in order to attenuate the effect of imbalanced classes and speeding up the running time. Method getRandomDocuments in AMultinomialLinearClassifier, the documents are chosen as follows :
    • All the positive documents of the classifier's class.
    • The n randomly selected negative documents with n = 3 x (number of positive documents for the class) 3 times was found to give good results by training the classifiers with a reasonable amount of negative documents.
  • The results are normalized and pruned to improve the precision and f1 score by _getNormalizedAndPrunedResults in AClassifier.
  • In order to improve performaces, the results are written to a file using a buffer writer in Printer class.

About

Information retrieval system, search engine, document classification, machine learning

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published