Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
.github/workflows		.github/workflows
lib		lib
man		man
.gitignore		.gitignore
LICENSE.md		LICENSE.md
Makefile		Makefile
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Repository files navigation

`docsim`

A simple, fast command-line tool for scoring the similarity of text documents.

$ docsim --query some-file.txt --show-scores ~/documents/notes
0.000  completely-dissimilar-file.txt
0.152  somewhat-similar-file.md
0.469  pretty-similar-file.md
0.872  very-similar-file.org
1.000  potentially-identical-file.txt

Given a query document and a collection of potential matches, docsim ranks each document in the collection by its textual similarity to the query.

Examples

Check the man page for the definitive documentation, but these should get you started.

Search for similar files in a given directory:

$ docsim --query some-file.txt ~/documents/notes
[...]

If no paths are provided docsim will search the current working directory.

$ docsim --query some-file.txt
[...]

Without a provided --query document docsim takes input from STDIN:

$ echo "Here's a query to search for." | docsim ~/documents/notes
[...]

Only show the top 3 matches, with the best at the top:

$ docsim --query some-file.txt --limit 3 --best-first ~/documents/notes
potentially-identical-file.txt
very-similar-file.org
pretty-similar-file.md

Find Go files similar to main.go in the current directory. Don't use stemming or stoplists, since these aren't English documents.

$ docsim --query main.go --no-stemming --no-stoplist **/*.go
[...]

Notice that because docsim uses an English stoplist and an English stemming algorithm, you'll almost certainly want to use the --no-stoplist and --no-stemming flags if your documents are written in another language (including source code).

Optionally, you can use the --stoplist flag to provide a custom stoplist. A custom stoplist is just a text file of words to ignore, separated by whitespace.

Installation

The easiest thing is probably to grab a compiled binary appropriate to your platform.

But if you've got a Go toolchain handy, you can also either:

$ git clone [email protected]:hrs/docsim.git
$ cd docsim
$ sudo make install

Or just:

$ go install github.com/hrs/docsim/docsim@latest

Note that using go install doesn't include the man page, which you can optionally install manually by copying into e.g. /usr/local/share/man/man1.

Running tests

$ make test

How it works

docsim uses TF-IDF weighting and cosine similarity to numerically score the textual similarity between the query and every other document.

"Textual similarity" roughly means "uses the same words." Each document is parsed into a big bag of words, which are passed through a common English stoplist, stemmed (so "spins," "spinner," and "spinning" might all reduce down to just "spin"), and assigned weights based on how often they appear in the document and how rare they are in the corpus as a whole.

We can think of each of these documents as a vector in term space, where each word is a dimension with its weight as a magnitude. Two documents are "similar," then, inasmuch as they point in the same direction, so we define similarity by the size of the angle between them.

Contributing

docsim is still in a nascent state, so I'm happy just writing the code myself for now, but please feel free to report any issues you encounter!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

`docsim`

Examples

Installation

Running tests

How it works

Contributing

About

Releases 8

Languages

License

hrs/docsim

Folders and files

Latest commit

History

Repository files navigation

docsim

Examples

Installation

Running tests

How it works

Contributing

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 8

Languages

`docsim`