Skip to content

A quick implementation of cosine-based string fuzzy lookup (a little bit like the famous simstring library) using sklearn, torch, and GPU acceleration. Can hold its own with an index of few million strings, batched queries, and GPU. Otherwise loses in speed to simstring, but is easy to install OTOH. I leave this here in case anyone wants it

License

Notifications You must be signed in to change notification settings

fginter/simstring-cuda

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A simple implementation of string search against a small-to-midsize (few million max) set of strings using torch and GPU acceleration. Cosine similarity of character 3-grams is the metric. This is meant to be a poor man's version of simstring, but does not scale up to anywhere near the DB sizes, and does not implement any of the fine tricks in simstring. On the other hand, it is easy to install. All it needs is sklearn and torch.

If the queries are batched by few hundred, the lookup against a DB of 1.4M strings from wikidata runs at 0.004sec per string on a relatively dated GPU.

Installation

python3 setup.py install

or

python3 setup.py bdist_wheel

the wheel file is in dist/SimString_cuda-0.1.0-py3-none-any.whl and then you can install it anywhere you want with pip3 install path/to/wheel.whl This has the advantage of adding the command-line executable simscuda into your path.

Usage

Here everywhere strings refers to a list of strings to index

Make an index and save it:

import simstringcuda as ssc
ssc_idx=ssc.build_index(strings)
ssc.save_index(ssc_idx,filename)

Load a saved index:

ssc_idx=ssc.load_index(filename)
ssc_idx.cuda() #If you place the index onto GPU, all search will happen on GPU, but you don't have to if you only have a small number of strings in your DB, this method passes all of its arguments to torch .cuda() call

Lookup some strings:

For GPU to make any sense, queries should preferably be batched into batches of few hundred or so, depending on your GPU memory. The limiting factor on memory is that a matrix of index x query is created. If your lookup runs out of GPU memory, make smaller query batches.

queries=["my","query","strings","there","can","be","many"]
res=ssc.lookup(queries,ssc_idx,10) #find top-10 hits for every query string

Command-line usage:

The command simscuda gets installed for you via pip, so maybe best install the package this way.

pip3 install path/to/builtwheel.whl
simscuda -h

Create an index out of all strings in a file, store it as index.fi file

bzcat strings.fi.bz2 | simscuda -c index.fi

Look up the first 1000 of these again

bzcat strings.fi.bz2 | head -n 1000 | simscuda index.fi

And get the output in a jsonl format for easier processing later

bzcat strings.fi.bz2 | head -n 1000 | simscuda --jsonl index.fi > out.jsonl

About

A quick implementation of cosine-based string fuzzy lookup (a little bit like the famous simstring library) using sklearn, torch, and GPU acceleration. Can hold its own with an index of few million strings, batched queries, and GPU. Otherwise loses in speed to simstring, but is easy to install OTOH. I leave this here in case anyone wants it

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages