Create n-grams, index items, and search through them.
go get github.com/hmerritt/go-ngram
I highly recommend reading the following:
The N
in n-gram refers to a set number. The most common n-gram is the 3-gram, usually called a "tri-gram".
A tri-gram index stores each item in sets of 3.
Hello, World -> [Hel ell llo lo, o,_ _Wo Wor orl rld]
Search("rld") -> Hello, World
^^^ any 3 (or more) characters will find a match
- Used to search through large data sets quickly.
- Has built-in spelling correction (only 3 chars need to match)
Methods | Description |
---|---|
NewNgramIndex |
Returns a new trigram index with the default settings |
NewIndexValue |
Returns a new index value using custom input values (use this for Add() ) |
Add |
Add a string AND your own index value |
Search |
Returns slice of all matched strings which contains your index value and the match strength, []*IndexValue |
Helper functions | |
StringToNgram |
Returns an ngram of length N - Splits a string into groups of N length |
GetMatches |
Get all ngram matches, returns an unsorted map of indexes along with their Index Value struct |
SortMatches |
Sorts output from GetMatches into a slice, first index [0] = best match |
// Create a new index
// Default ngram size is '3' (tri-gram)
ni := ngram.NewNgramIndex()
// Add string + add your OWN index value
// (the index value is what will be returned when a search matches)
ni.Add(string, *IndexValue)
ni.Add("I just think they're neat.", NewIndexValue(0, "This additional index string can be anything"))
ni.Add("Don’t eat me. I have a wife and kids. Eat them", NewIndexValue(1, "It could be a filepath"))
ni.Add("Words to match above quotes... string eat neat", NewIndexValue(2, "Or maybe a URL"))
// Search
// Returns a sorted slice of matches (sorts based on number of matches)
search := ni.Search("Eat")
// -> [1, 2]
// -> [IndexValue{1, "It could be a filepath"}, IndexValue{2, "Or maybe a URL"}]
search = ni.Search("near") // near will match [nea]t
// -> [0, 2]
// Usually, you would loop your own data-set
// and add the index from that as the ngram index value
for index, file := myFiles {
ni.Add(file.content, NewIndexValue(index, file.path)
}
package main
import (
"fmt"
ngram "github.com/hmerritt/go-ngram"
)
// Example structure
type File struct {
Path string
Content string
}
// Example data-set
var files = make([]File, 0)
func main() {
// Add Files to example data-set
files = append(files, File{
Path: "/home/me/nice.txt",
Content: "Example file data, nice",
})
files = append(files, File{
Path: "/home/me/example-file.txt",
Content: "More example file data, neat",
})
/*
* Create N-gram here!
*/
// Create a new index
// Default ngram size is '3' (tri-gram)
ni := ngram.NewNgramIndex()
// Add each File in our example data-set
for index, file := range files {
// Add a string + the index value we want to store in assosiation with the input string
ni.Add(file.Content, ngram.NewIndexValue(index, file.Path))
// What this is actually doing
ni.Add("More example file data, neat", ngram.NewIndexValue(1, "/home/me/example-file.txt"))
}
// Search indexed data
// returns a sorted slice
search := ni.Search("neat")
fmt.Println("Search =", search)
// -> [1]
// -> [IndexValue{
// Index: 1,
// Data: "/home/me/example-file.txt"
// }]
}