Implementation of Information Retrieval and Text Mining algorithms including:
- Indexers:
- Inverted
- KGram
- Boolean retrieval
- WildCard retrieval
- Distance calculation
- Ranking based retrieval (cosine-similarity and tf-idf)
- Perceptron classification
- Multiple confusion matrix stats
- KMeans Clustering, with RSS based optimization
-
The tests are run using xmlrunner (following the unittest style).
-
The documentation style is
NumPy/SciPy Docstrings
. -
Extensive Debugging
logging.debug()
calls are commented.