Yocto Search Engine

A tiny, yet functional, search engine.

Introduction

This is an attempt to implement the basic components that comprise a small yet functional free text search scenario on the Wikipedia XML english page dump. From corpus manipulation, indexing and persistence to basic retrieval from a simple command line interface. It was intended in order to familiarize with the basic notions and data structures that revolve around information retrieval. Even though it does not pretend to be a fully functional implementation of such a system, the size of the corpus (approximatelly 45G) posed some interesting memory usage and I/O blocking challenges that was fun to implement and worthed the effort :) Hope this can be helpful for anyone interested in understanding more things about searching text corpuses.

Latest Status (16/04/2013)

Takes Wikipedia database dump files, builds and indexes on author and main text body. With a command line interface you are able to issue search queries of single words or word prefixes which are ultimatelly resolved into a list of article titles that contain the given term. It was an implementation choice, at least for this first version, to have no dependencies on third party libraries.

Quick-Start Guide

On your mark

First clone the repository in a directory of your choice:

# git clone [email protected]:vmous/jYoctoSearchEngine.git

Get set

Jump into and use maven to assembly it in an executable .jar file

# cd jYoctoSearchEngine
# mvn package
...

The .jar file is created in the target directory.

Go!

To run the indexer issue

# java -Xms1g -Xmx2g -cp jYoctoSearchEngine-1.0.jar yocto.cli.Index <path_to_XML_dump>

This will start parsing and indexing the XML dump file you provided and put everything in a directory called index besides the .jar file.

Now, go buy yourself a coffee :-) And don't choose Espresso, it'll take at least a couple of hours :-)

When done you can run the searcher

# java -Xms1g -Xmx2g -cp jYoctoSearchEngine-1.0.jar yocto.cli.Search <index_directory>

, where <index_directory> is the directory created by the indexer on the previous step.

After it initializes you end up with a prompt like this

Yocto Search #

Issue your term queries

Yocto Search # google

You can also search authors/contributors

Yocto Search # author:prosody

You can even make prefix queries on both terms and authors, i.e

Yocto Search # author:prosopon
...
Yocto Search # author:prosody
...
Yocto Search # author:proso*
... + ...

At any point you can type q! to exit the search CLI. Don't forget to drop a comment or even a github star on your way out :)

Tech Report

Check here for some insights on the YoctoSearchEngine current implementation and future work.

Build Status

Travis CI:

Licence

Yocto Search Engine is distributed under the Apache 2 license.

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
resources		resources
src/yocto		src/yocto
test/yocto/indexing/parsing		test/yocto/indexing/parsing
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Yocto Search Engine

Introduction

Latest Status (16/04/2013)

Quick-Start Guide

On your mark

Get set

Go!

Tech Report

Build Status

Licence

About

Releases

Packages

Languages

License

vmous/jYoctoSearchEngine

Folders and files

Latest commit

History

Repository files navigation

Yocto Search Engine

Introduction

Latest Status (16/04/2013)

Quick-Start Guide

On your mark

Get set

Go!

Tech Report

Build Status

Licence

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages