Skip to content

maurezen/indexer

Repository files navigation

What

A basic (text) file indexer library in Kotlin. Given a character sequence, an indexer finds all the occurrences of this sequence as a substring in a given set of files. Primary use case is indexing a moderately-sized codebase for subsequent searches.

How to build

gradle(w) clean shadowJar kotlinSourcesJar Place indexer-$version-all.jar somewhere on your classpath. Point your IDE towards indexer-$version-sources.jar if needed.

How to use

Point an indexer towards the directory you need indexed

val index = IndexBuilderCoroutines()
    .with(dirName)
    .buildAsync().await()

shoot your queries at it after it's done

// this gets you just a set of filenames that contain this query string
// on the plus side, it doesn't have to read the files for that
val entry = index.query("foobar")

to get more details

// this gets you lines and line positions for each file
// on the flip side, index has to re-read the files for that
val richEntry = index.queryAndScan("lorem ipsum")

to update the index

val indexBuilder = IndexBuilderCoroutines()
    .with(dirName)

var index = indexBuilder.buildAsync.await()

// something something something

runBlocking {
    //update is a suspend fun. Control flow is completely up to you.
    index = indexBuilder.update()
}

Advanced usage

Want more filesystem roots? Sure. As many as you would reasonably want.

val indexBuilder = IndexBuilderCoroutines()
    .with(dirNameA)
    .with(dirNameB)
    .with(listOf(dirNameC, dirNameD, dirNameE))

Want only specific files? Apply file filter. A default behaviour is to accept every file. Directories are always accepted.

val filter = object:java.io.FileFilter { /* whatever */ }
val indexBuilder = IndexBuilderCoroutines()
    .with(dirName)
    .filter(filter)

Want only large files? Or only small files? Or want to run a complex heuristic on file contents? There's an extension point for that. See javadoc for the details. A default behaviour is to accept everything; there's a sample whitelist-based implementation that discards files as soon as it encounters too many non-whitelisted characters.

val inspector = object: org.maurezen.indexer.ContentInspector { /* whatever */ }
val indexBuilder = IndexBuilderCoroutines()
    .with(dirName)
    .inspectedBy(inspector)

Want to deal with non-standard file formats or encodings? Implement your own reader. See javadoc for the details. A default behaviour is to assume files are UTF-8 encoded.

val reader = object: org.maurezen.indexer.FileReader { /* whatever */ }
val indexBuilder = IndexBuilderCoroutines()
    .with(dirName)
    .readBy(reader)

Want to share index between threads? Share a builder instance and request an index.

//thread A
val indexBuilder = IndexBuilderCoroutines()
    .with(dirName)
//thread B
val index = indexBuilder.get()

Have a more prolonged lifecycle? Want an update? Keep a builder instance to yourself and trigger a build again when needed. indexBuilder.get() will be returning the previous index version until the new computation completes.

val indexBuilder = IndexBuilderCoroutines()
    .with(dirName) 

var index = indexBuilder.buildAsync().await()

//things happen here
//...
//and now it's time for a refresh

index = indexBuilder.buildAsync().await()

Changed your mind and don't want that refresh anymore?

val indexBuilder = IndexBuilderCoroutines()
    .with(dirName) 

var indexDeferred = indexBuilder.buildAsync()

indexDeferred.cancel()

Performance

While a robust performance setup doesn't exist as of now, here is the anecdotal data for indexing all the files of an intellij-community-master snapshot dated late 2020 on mostly-available (sub-10% idle usage) 5950x:

Size: 563 MB (590,787,068 bytes)
Contains: 120,686 Files, 26,343 Folders
Created: Saturday, December 12, 2020
IndexBuilderCoroutines()
    .with(INTELLIJ_COMMUNITY_MASTER)
    .inspectedBy(WhitelistCharacterInspector(5))
    .filter(ACCEPTS_EVERYTHING)
    .buildAsync().await()
-Xmx Time
12g 26.7s
2g 41.3s

While, again, a robust memory footprint measurement doesn't exist, a ready-to-query index of intellij-community-master has a memory footprint of ~230Mb. The indexing process itself, though, requires anywhere from 12Gb to as little as 1Gb, depending on indexing pipeline settings and trading throughput for memory footprint.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages