resolved conflict

mem0ai · Jul 11, 2023 · abf2559 · abf2559
2 parents ec95d66 + 9ca8365
commit abf2559
Show file tree

Hide file tree

Showing 32 changed files with 502 additions and 211 deletions.
diff --git a/Makefile b/Makefile
@@ -0,0 +1,24 @@
+# Variables
+PYTHON := python3
+PIP := $(PYTHON) -m pip
+PROJECT_NAME := embedchain
+
+# Targets
+.PHONY: install format lint clean test
+
+install:
+ $(PIP) install --upgrade pip
+ $(PIP) install .[dev]
+
+format:
+ $(PYTHON) -m black .
+ $(PYTHON) -m isort .
+
+lint:
+ $(PYTHON) -m ruff .
+
+clean:
+ rm -rf dist build *.egg-info
+
+test:
+ $(PYTHON) -m pytest
diff --git a/README.md b/README.md
@@ -44,6 +44,7 @@ embedchain is a framework to easily create LLM powered bots over any dataset. If
  - [Reset](#reset)
  - [Count](#count)
 - [How does it work?](#how-does-it-work)
+- [Contribution Guidelines](#contribution-guidelines)
 - [Tech Stack](#tech-stack)
 - [Team](#team)
  - [Author](#author)
@@ -224,7 +225,7 @@ print(naval_chat_bot.chat("what did the author say about happiness?"))
 
 ### Stream Response
 
-- You can add config to your query method to stream responses like ChatGPT does. You would require a downstream handler to render the chunk in your desirable format. Currently only supports OpenAI model.
+- You can add config to your query method to stream responses like ChatGPT does. You would require a downstream handler to render the chunk in your desirable format. Supports both OpenAI model and OpenSourceApp.
 
 - To use this, instantiate a `QueryConfig` or `ChatConfig` object with `stream=True`. Then pass it to the `.chat()` or `.query()` method. The following example iterates through the chunks and prints them as they appear.
 
@@ -384,8 +385,17 @@ config = InitConfig(ef=embedding_functions.OpenAIEmbeddingFunction(
  ))
 naval_chat_bot = App(config)
 
-add_config = AddConfig() # Currently no options
-naval_chat_bot.add("youtube_video", "https://www.youtube.com/watch?v=3qHkcs3kG44", add_config)
+# Example: define your own chunker config for `youtube_video`
+youtube_add_config = {
+ "chunker": {
+ "chunk_size": 1000,
+ "chunk_overlap": 100,
+ "length_function": len,
+ }
+}
+naval_chat_bot.add("youtube_video", "https://www.youtube.com/watch?v=3qHkcs3kG44", AddConfig(**youtube_add_config))
+
+add_config = AddConfig()
 naval_chat_bot.add("pdf_file", "https://navalmanack.s3.amazonaws.com/Eric-Jorgenson_The-Almanack-of-Naval-Ravikant_Final.pdf", add_config)
 naval_chat_bot.add("web_page", "https://nav.al/feedback", add_config)
 naval_chat_bot.add("web_page", "https://nav.al/agi", add_config)
@@ -457,13 +467,39 @@ This section describes all possible config options.
 
 #### **Add Config**
 
+|option|description|type|default|
+|---|---|---|---|
+|chunker|chunker config|ChunkerConfig|Default values for chunker depends on the `data_type`. Please refer [ChunkerConfig](#chunker-config)|
+|loader|loader config|LoaderConfig|None|
+
+##### **Chunker Config**
+
+|option|description|type|default|
+|---|---|---|---|
+|chunk_size|Maximum size of chunks to return|int|Default value for various `data_type` mentioned below|
+|chunk_overlap|Overlap in characters between chunks|int|Default value for various `data_type` mentioned below|
+|length_function|Function that measures the length of given chunks|typing.Callable|Default value for various `data_type` mentioned below|
+
+Default values of chunker config parameters for different `data_type`:
+
+|data_type|chunk_size|chunk_overlap|length_function|
+|---|---|---|---|
+|docx|1000|0|len|
+|text|300|0|len|
+|qna_pair|300|0|len|
+|web_page|500|0|len|
+|pdf_file|1000|0|len|
+|youtube_video|2000|0|len|
+
+##### **Loader Config**
+
 _coming soon_
 
 #### **Query Config**
 
 |option|description|type|default|
 |---|---|---|---|
-|template|custom template for prompt|Template|Template("Use the following pieces of context to answer the query at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. \$context Query: $query Helpful Answer:")|
+|template|custom template for prompt|Template|Template("Use the following pieces of context to answer the query at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. \$context Query: \$query Helpful Answer:")|
 |history|include conversation history from your client or database|any (recommendation: list[str])|None
 |stream|control if response is streamed back to the user|bool|False|
 
@@ -523,6 +559,27 @@ embedchain is a framework which takes care of all these nuances and provides a s
 
 In the first release, we are making it easier for anyone to get a chatbot over any dataset up and running in less than a minute. All you need to do is create an app instance, add the data sets using `.add` function and then use `.query` function to get the relevant answer.
 
+# Contribution Guidelines
+
+Thank you for your interest in contributing to the EmbedChain project! We welcome your ideas and contributions to help improve the project. Please follow the instructions below to get started:
+
+1. **Fork the repository**: Click on the "Fork" button at the top right corner of this repository page. This will create a copy of the repository in your own GitHub account.
+
+2. **Install the required dependencies**: Ensure that you have the necessary dependencies installed in your Python environment. You can do this by running the following command:
+
+```bash
+make install
+```
+
+3. **Make changes in the code**: Create a new branch in your forked repository and make your desired changes in the codebase.
+4. **Format code**: Before creating a pull request, it's important to ensure that your code follows our formatting guidelines. Run the following commands to format the code:
+
+```bash
+make lint format
+```
+
+5. **Create a pull request**: When you are ready to contribute your changes, submit a pull request to the EmbedChain repository. Provide a clear and descriptive title for your pull request, along with a detailed description of the changes you have made.
+
 # Tech Stack
 
 embedchain is built on the following stack:

diff --git a/embedchain/__init__.py b/embedchain/__init__.py
@@ -1 +1 @@
-from .embedchain import App, OpenSourceApp, PersonApp, PersonOpenSourceApp
+from .embedchain import App, OpenSourceApp, PersonApp, PersonOpenSourceApp
diff --git a/embedchain/chunkers/base_chunker.py b/embedchain/chunkers/base_chunker.py
@@ -3,14 +3,17 @@
 
 class BaseChunker:
  def __init__(self, text_splitter):
+ """Initialize the chunker."""
  self.text_splitter = text_splitter
 
  def create_chunks(self, loader, src):
  """
  Loads data and chunks it.
 
- :param loader: The loader which's `load_data` method is used to create the raw data.
- :param src: The data to be handled by the loader. Can be a URL for remote sources or local content for local loaders. 
+ :param loader: The loader which's `load_data` method is used to create
+ the raw data.
+ :param src: The data to be handled by the loader. Can be a URL for
+ remote sources or local content for local loaders.
  """
  documents = []
  ids = []
@@ -26,7 +29,7 @@ def create_chunks(self, loader, src):
 
  for chunk in chunks:
  chunk_id = hashlib.sha256((chunk + url).encode()).hexdigest()
- if (idMap.get(chunk_id) is None):
+ if idMap.get(chunk_id) is None:
  idMap[chunk_id] = True
  ids.append(chunk_id)
  documents.append(chunk)

diff --git a/embedchain/chunkers/docx_file.py b/embedchain/chunkers/docx_file.py
@@ -1,7 +1,9 @@
-from embedchain.chunkers.base_chunker import BaseChunker
+from typing import Optional
 
 from langchain.text_splitter import RecursiveCharacterTextSplitter
 
+from embedchain.chunkers.base_chunker import BaseChunker
+from embedchain.config.AddConfig import ChunkerConfig
 
 TEXT_SPLITTER_CHUNK_PARAMS = {
  "chunk_size": 1000,
@@ -11,6 +13,10 @@
 
 
 class DocxFileChunker(BaseChunker):
- def __init__(self):
- text_splitter = RecursiveCharacterTextSplitter(**TEXT_SPLITTER_CHUNK_PARAMS)
+ """Chunker for .docx file."""
+
+ def __init__(self, config: Optional[ChunkerConfig] = None):
+ if config is None:
+ config = TEXT_SPLITTER_CHUNK_PARAMS
+ text_splitter = RecursiveCharacterTextSplitter(**config)
  super().__init__(text_splitter)
diff --git a/embedchain/chunkers/pdf_file.py b/embedchain/chunkers/pdf_file.py
@@ -1,7 +1,9 @@
-from embedchain.chunkers.base_chunker import BaseChunker
+from typing import Optional
 
 from langchain.text_splitter import RecursiveCharacterTextSplitter
 
+from embedchain.chunkers.base_chunker import BaseChunker
+from embedchain.config.AddConfig import ChunkerConfig
 
 TEXT_SPLITTER_CHUNK_PARAMS = {
  "chunk_size": 1000,
@@ -11,6 +13,10 @@
 
 
 class PdfFileChunker(BaseChunker):
- def __init__(self):
- text_splitter = RecursiveCharacterTextSplitter(**TEXT_SPLITTER_CHUNK_PARAMS)
- super().__init__(text_splitter)
+ """Chunker for PDF file."""
+
+ def __init__(self, config: Optional[ChunkerConfig] = None):
+ if config is None:
+ config = TEXT_SPLITTER_CHUNK_PARAMS
+ text_splitter = RecursiveCharacterTextSplitter(**config)
+ super().__init__(text_splitter)
diff --git a/embedchain/chunkers/qna_pair.py b/embedchain/chunkers/qna_pair.py
@@ -1,7 +1,9 @@
-from embedchain.chunkers.base_chunker import BaseChunker
+from typing import Optional
 
 from langchain.text_splitter import RecursiveCharacterTextSplitter
 
+from embedchain.chunkers.base_chunker import BaseChunker
+from embedchain.config.AddConfig import ChunkerConfig
 
 TEXT_SPLITTER_CHUNK_PARAMS = {
  "chunk_size": 300,
@@ -11,6 +13,10 @@
 
 
 class QnaPairChunker(BaseChunker):
- def __init__(self):
- text_splitter = RecursiveCharacterTextSplitter(**TEXT_SPLITTER_CHUNK_PARAMS)
+ """Chunker for QnA pair."""
+
+ def __init__(self, config: Optional[ChunkerConfig] = None):
+ if config is None:
+ config = TEXT_SPLITTER_CHUNK_PARAMS
+ text_splitter = RecursiveCharacterTextSplitter(**config)
  super().__init__(text_splitter)
diff --git a/embedchain/chunkers/text.py b/embedchain/chunkers/text.py
@@ -1,7 +1,9 @@
-from embedchain.chunkers.base_chunker import BaseChunker
+from typing import Optional
 
 from langchain.text_splitter import RecursiveCharacterTextSplitter
 
+from embedchain.chunkers.base_chunker import BaseChunker
+from embedchain.config.AddConfig import ChunkerConfig
 
 TEXT_SPLITTER_CHUNK_PARAMS = {
  "chunk_size": 300,
@@ -11,6 +13,10 @@
 
 
 class TextChunker(BaseChunker):
- def __init__(self):
- text_splitter = RecursiveCharacterTextSplitter(**TEXT_SPLITTER_CHUNK_PARAMS)
+ """Chunker for text."""
+
+ def __init__(self, config: Optional[ChunkerConfig] = None):
+ if config is None:
+ config = TEXT_SPLITTER_CHUNK_PARAMS
+ text_splitter = RecursiveCharacterTextSplitter(**config)
  super().__init__(text_splitter)
diff --git a/embedchain/chunkers/web_page.py b/embedchain/chunkers/web_page.py
@@ -1,7 +1,9 @@
-from embedchain.chunkers.base_chunker import BaseChunker
+from typing import Optional
 
 from langchain.text_splitter import RecursiveCharacterTextSplitter
 
+from embedchain.chunkers.base_chunker import BaseChunker
+from embedchain.config.AddConfig import ChunkerConfig
 
 TEXT_SPLITTER_CHUNK_PARAMS = {
  "chunk_size": 500,
@@ -11,6 +13,10 @@
 
 
 class WebPageChunker(BaseChunker):
- def __init__(self):
- text_splitter = RecursiveCharacterTextSplitter(**TEXT_SPLITTER_CHUNK_PARAMS)
+ """Chunker for web page."""
+
+ def __init__(self, config: Optional[ChunkerConfig] = None):
+ if config is None:
+ config = TEXT_SPLITTER_CHUNK_PARAMS
+ text_splitter = RecursiveCharacterTextSplitter(**config)
  super().__init__(text_splitter)
diff --git a/embedchain/chunkers/youtube_video.py b/embedchain/chunkers/youtube_video.py
@@ -1,7 +1,9 @@
-from embedchain.chunkers.base_chunker import BaseChunker
+from typing import Optional
 
 from langchain.text_splitter import RecursiveCharacterTextSplitter
 
+from embedchain.chunkers.base_chunker import BaseChunker
+from embedchain.config.AddConfig import ChunkerConfig
 
 TEXT_SPLITTER_CHUNK_PARAMS = {
  "chunk_size": 2000,
@@ -11,6 +13,10 @@
 
 
 class YoutubeVideoChunker(BaseChunker):
- def __init__(self):
- text_splitter = RecursiveCharacterTextSplitter(**TEXT_SPLITTER_CHUNK_PARAMS)
- super().__init__(text_splitter)
+ """Chunker for Youtube video."""
+
+ def __init__(self, config: Optional[ChunkerConfig] = None):
+ if config is None:
+ config = TEXT_SPLITTER_CHUNK_PARAMS
+ text_splitter = RecursiveCharacterTextSplitter(**config)
+ super().__init__(text_splitter)
diff --git a/embedchain/config/AddConfig.py b/embedchain/config/AddConfig.py
@@ -1,8 +1,42 @@
+from typing import Callable, Optional
+
 from embedchain.config.BaseConfig import BaseConfig
 
+
+class ChunkerConfig(BaseConfig):
+ """
+ Config for the chunker used in `add` method
+ """
+
+ def __init__(
+ self,
+ chunk_size: Optional[int] = 4000,
+ chunk_overlap: Optional[int] = 200,
+ length_function: Optional[Callable[[str], int]] = len,
+ ):
+ self.chunk_size = chunk_size
+ self.chunk_overlap = chunk_overlap
+ self.length_function = length_function
+
+
+class LoaderConfig(BaseConfig):
+ """
+ Config for the chunker used in `add` method
+ """
+
+ def __init__(self):
+ pass
+
+
 class AddConfig(BaseConfig):
  """
  Config for the `add` method.
  """
- def __init__(self):
- pass
+
+ def __init__(
+ self,
+ chunker: Optional[ChunkerConfig] = None,
+ loader: Optional[LoaderConfig] = None,
+ ):
+ self.loader = loader
+ self.chunker = chunker
diff --git a/embedchain/config/BaseConfig.py b/embedchain/config/BaseConfig.py
@@ -2,6 +2,7 @@ class BaseConfig:
  """
  Base config.
  """
+
  def __init__(self):
  pass