Skip to content

Latest commit

 

History

History
41 lines (30 loc) · 2.05 KB

README.md

File metadata and controls

41 lines (30 loc) · 2.05 KB


LSTC

This is an approach to classifying Wikipedia documents into one of 452167 categories without using hierarchy. The collection contains over 2 millions documents with a number of meaningful terms in each.

Data

The input data files test.csv and train.csv represent document sets as vector space model and follow the format as described. Downloading requires logging in Kaggle.

Algorithm

The approach implements k-NN classification algorithm. Neighbours of each document from test.csv are searched in a subset of training instances that only contain the most meaningful terms of the processed document.

  • Term weight is estimated by TF-iDF statistics: let denote n occurences of term t in document d in D, then

  • Metric function of k-NN algorithm is Jaccard distance:

  • Predicted categories are the most frequent categories in obtained neighbour instances.

Compilation and running

  • Requirements:

    • GCC 4.8 or higher;
    • 3+GB RAM;
  • Put test.csv and train.csv into data folder;

  • Run make to compile a binary lstc;

  • Run lstc with

    ./lstc [cats <..>][nghbrs <..>][tfidfs <..>][threads <..>]
    

    Optional command line arguments are:

    • cats - maximum number of categories to predict, 8 by default. The number is supposed to be non-constant, the approach doesn't imply heuristics for predicting it;
    • nghbrs - maximum number of the nearest neighbours, 5 by default;
    • tfidfs - number of the most important terms to search for in the scoped document, i.e. the ones with the largest TF-iDF's, 4 by default.
    • threads - number of OpenMP threads.

The result file output.csv will be placed into output folder.