A natural language search engine for your personal notes, transactions and images
Supported Plugins
- Features
- Demos
- Architecture
- Setup
- Use
- Upgrade
- Uninstall
- Troubleshoot
- Advanced Usage
- Miscellaneous
- Performance
- Development
- Credits
- Natural: Advanced natural language understanding using Transformer based ML Models
- Local: Your personal data stays local. All search, indexing is done on your machine*
- Incremental: Incremental search for a fast, search-as-you-type experience
- Pluggable: Modular architecture makes it easy to plug in new data sources, frontends and ML models
- Multiple Sources: Search your Org-mode and Markdown notes, Beancount transactions and Photos
- Multiple Interfaces: Search from your Web Browser, Emacs or Obsidian
khoj_obsidian_demo_0.1.0_720p.mp4
Description
- Install Khoj via
pip
and start Khoj backend in non-gui mode - Install Khoj plugin via Community Plugins settings pane on Obsidian app
- Check the new Khoj plugin settings
- Let Khoj backend index the markdown files in the current Vault
- Open Khoj plugin on Obsidian via Search button on Left Pane
- Search "Announce plugin to folks" in the Obsidian Plugin docs
- Jump to the search result
Khoj_Incremental_Search_Demo_0.1.5.mp4
Description
- Install Khoj via pip
- Start Khoj app
- Add this readme and khoj.el readme as org-mode for Khoj to index
- Search "Setup editor" on the Web and Emacs. Re-rank the results for better accuracy
- Top result is what we are looking for, the section to Install Khoj.el on Emacs
Analysis
- The results do not have any words used in the query
- Based on the top result it seems the re-ranking model understands that Emacs is an editor?
- The results incrementally update as the query is entered
- The results are re-ranked, for better accuracy, once user hits enter
These are the general setup instructions for Khoj.
- Make sure python and pip are installed on your machine
- Check the Khoj.el Readme to setup Khoj with Emacs
- Check the Khoj Obsidian Readme to setup Khoj with Obsidian
Its simpler as it can skip the configure step below.
pip install khoj-assistant
khoj
Note: To start Khoj automatically in the background use Task scheduler on Windows or Cron on Mac, Linux (e.g with @reboot khoj
)
- Enable content types and point to files to search in the First Run Screen that pops up on app start
- Click
Configure
and wait. The app will download ML models and index the content for search
Khoj exposes a web interface by default.
The optional steps below allow using Khoj from within an existing application like Obsidian or Emacs.
- Khoj via Obsidian
- Click the Khoj search icon 🔎 on the Ribbon or Search for Khoj: Search in the Command Palette
- Khoj via Emacs
- Run
M-x khoj <user-query>
- Run
- Khoj via Web
- Open https://localhost:8000/ via desktop interface or directly
- Khoj via API
- See the Khoj FastAPI Swagger Docs, ReDocs
Query Filters
Use structured query syntax to filter the natural language search results
- Word Filter: Get entries that include/exclude a specified term
- Entries that contain term_to_include:
+"term_to_include"
- Entries that contain term_to_exclude:
-"term_to_exclude"
- Entries that contain term_to_include:
- Date Filter: Get entries containing dates in YYYY-MM-DD format from specified date (range)
- Entries from April 1st 1984:
dt:"1984-04-01"
- Entries after March 31st 1984:
dt>="1984-04-01"
- Entries before April 2nd 1984 :
dt<="1984-04-01"
- Entries from April 1st 1984:
- File Filter: Get entries from a specified file
- Entries from incoming.org file:
file:"incoming.org"
- Entries from incoming.org file:
- Combined Example
what is the meaning of life? file:"1984.org" dt>="1984-01-01" dt<="1985-01-01" -"big" -"brother"
- Adds all filters to the natural language query. It should return entries
- from the file 1984.org
- containing dates from the year 1984
- excluding words "big" and "brother"
- that best match the natural language query "what is the meaning of life?"
- Creates a personal assistant for you to inquire and engage with your notes
- Uses ChatGPT and Khoj search
- Supports multi-turn conversations with the relevant notes for context
- Shows reference notes used to generate a response
- Note: Your query and top notes from khoj search will be sent to OpenAI for processing
- Your query is used to retrieve the most relevant notes, if any, using Khoj search
- These notes, the last few messages and associated metadata is passed to ChatGPT along with your query for a response
pip install --upgrade khoj-assistant
Note: To upgrade to the latest pre-release version of the khoj server run below command
# Maps to the latest commit on the master branch
pip install --upgrade --pre khoj-assistant
- Use your Emacs Package Manager to Upgrade
- See khoj.el readme for details
- Upgrade via the Community plugins tab on the settings pane in the Obsidian app
- See the khoj plugin readme for details
- (Optional) Hit
Ctrl-C
in the terminal running the khoj server to stop it - Delete the khoj directory in your home folder (i.e
~/.khoj
on Linux, Mac orC:\Users\<your-username>\.khoj
on Windows) - Uninstall the khoj server with
pip uninstall khoj-assistant
- (Optional) Uninstall khoj.el or the khoj obsidian plugin in the standard way on Emacs, Obsidian
- Details:
pip install khoj-assistant
fails while building thetokenizers
dependency. Complains about Rust. - Fix: Install Rust to build the tokenizers package. For example on Mac run:
brew install rustup rustup-init source ~/.cargo/env
- Refer: Issue with Fix for more details
- Fix: Open /api/update?force=true1 in browser to regenerate index from scratch
- Note: This is a fix for when you percieve the search results have degraded. Not if you think they've always given wonky results
- Fix: Increase RAM available to Docker Containers in Docker Settings
- Refer: StackOverflow Solution, Configure Resources on Docker for Mac
- Mitigation: Disable
image
search using the desktop GUI
- Setup Khoj on your personal server. This can be any always-on machine, i.e an old computer, RaspberryPi(?) etc
- Install Tailscale on your personal server and phone
- Open the Khoj web interface of the server from your phone browser.
It should behttps://tailscale-ip-of-server:8000
orhttps://name-of-server:8000
if you've setup MagicDNS - Click the Add to Homescreen button
- Enjoy exploring your notes, transactions and images from your phone!
- Set
encoder-type
,encoder
andmodel-directory
underasymmetric
and/orsymmetric
search-type
in yourkhoj.yml
2:asymmetric: - encoder: "sentence-transformers/multi-qa-MiniLM-L6-cos-v1" + encoder: text-embedding-ada-002 + encoder-type: src.khoj.utils.models.OpenAI cross-encoder: "cross-encoder/ms-marco-MiniLM-L-6-v2" - encoder-type: sentence_transformers.SentenceTransformer - model_directory: "~/.khoj/search/asymmetric/" + model-directory: null
- Setup your OpenAI API key in Khoj
- Restart Khoj server to generate embeddings. It will take longer than with offline models.
This configuration uses an online model
- It will send all notes to OpenAI to generate embeddings
- All queries will be sent to OpenAI when you search with Khoj
- You will be charged by OpenAI based on the total tokens processed
- It requires an active internet connection to search and index
To search for notes in multiple, different languages, you can use a multi-lingual model.
For example, the paraphrase-multilingual-MiniLM-L12-v2 supports 50+ languages, has good search quality and speed. To use it:
- Manually update
search-type > asymmetric > encoder
tosentence-transformer/paraphrase-multilingual-MiniLM-L12-v2
in your~/.khoj/khoj.yml
file for now. See diff ofkhoj.yml
below for illustration:
asymmetric:
- encoder: "sentence-transformers/multi-qa-MiniLM-L6-cos-vi"
+ encoder: "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
cross-encoder: "cross-encoder/ms-marco-MiniLM-L-6-v2"
model_directory: "~/.khoj/search/asymmetric/"
- Regenerate your content index. For example, by opening <khoj-url>/api/update?t=force
If you want, Khoj can be configured to use OpenAI for search and chat.
Add your OpenAI API to Khoj by using either of the two options below:
- Open the Khoj desktop GUI, add your OpenAI API key and click Configure
Ensure khoj is started without the
--no-gui
flag. Check your system tray to see if Khoj 🦅 is minimized there. - Set
openai-api-key
field underprocessor.conversation
section in yourkhoj.yml
2 to your OpenAI API key and restart khoj:processor: conversation: - openai-api-key: # "YOUR_OPENAI_API_KEY" + openai-api-key: sk-aaaaaaaaaaaaaaaaaaaaaaaahhhhhhhhhhhhhhhhhhhhhhhh model: "text-davinci-003" conversation-logfile: "~/.khoj/processor/conversation/conversation_logs.json"
Warning: This will enable Khoj to send your query and note(s) to OpenAI for processing
- The chat, answer and search API endpoints use OpenAI API
- They are disabled by default
- To use them:
- Setup your OpenAI API key in Khoj
- Interact with them from the Khoj Swagger docs1
- Semantic search using the bi-encoder is fairly fast at <50 ms
- Reranking using the cross-encoder is slower at <2s on 15 results. Tweak
top_k
to tradeoff speed for accuracy of results - Filters in query (e.g by file, word or date) usually add <20ms to query latency
- Indexing is more strongly impacted by the size of the source data
- Indexing 100K+ line corpus of notes takes about 10 minutes
- Indexing 4000+ images takes about 15 minutes and more than 8Gb of RAM
- Note: It should only take this long on the first run as the index is incrementally updated
- Testing done on a Mac M1 and a >100K line corpus of notes
- Search, indexing on a GPU has not been tested yet
# Get Khoj Code
git clone https://github.com/debanjum/khoj && cd khoj
# Create, Activate Virtual Environment
python3 -m venv .venv && source .venv/bin/activate
# Install Khoj for Development
pip install -e .[dev]
- Start Khoj
khoj -vv
- Configure Khoj
- Via GUI: Add files, directories to index in the GUI window that pops up on starting Khoj, then Click Configure
- Manually:
- Copy the
config/khoj_sample.yml
to~/.khoj/khoj.yml
- Set
input-files
orinput-filter
in each relevantcontent-type
section of~/.khoj/khoj.yml
- Set
input-directories
field inimage
content-type
section
- Set
- Delete
content-type
andprocessor
sub-section(s) irrelevant for your use-case - Restart khoj
- Copy the
Note: Wait after configuration for khoj to Load ML model, generate embeddings and expose API to query notes, images, transactions etc specified in config YAML
git clone https://github.com/debanjum/khoj && cd khoj
- Required: Update docker-compose.yml to mount your images, (org-mode or markdown) notes and beancount directories
- Optional: Edit application configuration in khoj_docker.yml
docker-compose up -d
Note: The first run will take time. Let it run, it's mostly not hung, just generating embeddings
docker-compose build --pull
git clone https://github.com/debanjum/khoj && cd khoj
conda env create -f config/environment.yml
conda activate khoj
python3 -m pip install pyqt6 # As conda does not support pyqt6 yet
- Copy the
config/khoj_sample.yml
to~/.khoj/khoj.yml
- Set
input-files
orinput-filter
in each relevantcontent-type
section of~/.khoj/khoj.yml
- Set
input-directories
field inimage
content-type
section
- Set
- Delete
content-type
,processor
sub-sections irrelevant for your use-case
python3 -m src.khoj.main -vv
Load ML model, generate embeddings and expose API to query notes, images, transactions etc specified in config YAML
cd khoj
git pull origin master
conda deactivate khoj
conda env update -f config/environment.yml
conda activate khoj
- Install Git Hooks for Validation
pre-commit install -t pre-push -t pre-commit
- This ensures standard code formatting fixes and other checks run automatically on every commit and push
- Note 1: If pre-commit didn't already get installed, install it via
pip install pre-commit
- Note 2: To run the pre-commit changes manually, use
pre-commit run --hook-stage manual --all
before creating PR
-
Run Tests
pytest
-
Run MyPy to check types
mypy --config-file pyproject.toml
-
Automated validation workflows run for every PR.
Ensure any issues seen by them our fixed
-
Test the python packge created for a PR
- Download and extract the zipped
.whl
artifact generated from the pypi workflow run for the PR. - Install (in your virtualenv) with
pip install /path/to/download*.whl>
- Start and use the application to see if it works fine
- Download and extract the zipped
- Multi-QA MiniLM Model, All MiniLM Model for Text Search. See SBert Documentation
- OpenAI CLIP Model for Image Search. See SBert Documentation
- Charles Cave for OrgNode Parser
- Org.js to render Org-mode results on the Web interface
- Markdown-it to render Markdown results on the Web interface