Skip to content

Commit

Permalink
resolved conflict
Browse files Browse the repository at this point in the history
  • Loading branch information
Aaishik Dutta authored and Aaishik Dutta committed Jul 11, 2023
2 parents ec95d66 + 9ca8365 commit abf2559
Show file tree
Hide file tree
Showing 32 changed files with 502 additions and 211 deletions.
24 changes: 24 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Variables
PYTHON := python3
PIP := $(PYTHON) -m pip
PROJECT_NAME := embedchain

# Targets
.PHONY: install format lint clean test

install:
$(PIP) install --upgrade pip
$(PIP) install .[dev]

format:
$(PYTHON) -m black .
$(PYTHON) -m isort .

lint:
$(PYTHON) -m ruff .

clean:
rm -rf dist build *.egg-info

test:
$(PYTHON) -m pytest
65 changes: 61 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ embedchain is a framework to easily create LLM powered bots over any dataset. If
- [Reset](#reset)
- [Count](#count)
- [How does it work?](#how-does-it-work)
- [Contribution Guidelines](#contribution-guidelines)
- [Tech Stack](#tech-stack)
- [Team](#team)
- [Author](#author)
Expand Down Expand Up @@ -224,7 +225,7 @@ print(naval_chat_bot.chat("what did the author say about happiness?"))

### Stream Response

- You can add config to your query method to stream responses like ChatGPT does. You would require a downstream handler to render the chunk in your desirable format. Currently only supports OpenAI model.
- You can add config to your query method to stream responses like ChatGPT does. You would require a downstream handler to render the chunk in your desirable format. Supports both OpenAI model and OpenSourceApp.

- To use this, instantiate a `QueryConfig` or `ChatConfig` object with `stream=True`. Then pass it to the `.chat()` or `.query()` method. The following example iterates through the chunks and prints them as they appear.

Expand Down Expand Up @@ -384,8 +385,17 @@ config = InitConfig(ef=embedding_functions.OpenAIEmbeddingFunction(
))
naval_chat_bot = App(config)

add_config = AddConfig() # Currently no options
naval_chat_bot.add("youtube_video", "https://www.youtube.com/watch?v=3qHkcs3kG44", add_config)
# Example: define your own chunker config for `youtube_video`
youtube_add_config = {
"chunker": {
"chunk_size": 1000,
"chunk_overlap": 100,
"length_function": len,
}
}
naval_chat_bot.add("youtube_video", "https://www.youtube.com/watch?v=3qHkcs3kG44", AddConfig(**youtube_add_config))

add_config = AddConfig()
naval_chat_bot.add("pdf_file", "https://navalmanack.s3.amazonaws.com/Eric-Jorgenson_The-Almanack-of-Naval-Ravikant_Final.pdf", add_config)
naval_chat_bot.add("web_page", "https://nav.al/feedback", add_config)
naval_chat_bot.add("web_page", "https://nav.al/agi", add_config)
Expand Down Expand Up @@ -457,13 +467,39 @@ This section describes all possible config options.

#### **Add Config**

|option|description|type|default|
|---|---|---|---|
|chunker|chunker config|ChunkerConfig|Default values for chunker depends on the `data_type`. Please refer [ChunkerConfig](#chunker-config)|
|loader|loader config|LoaderConfig|None|

##### **Chunker Config**

|option|description|type|default|
|---|---|---|---|
|chunk_size|Maximum size of chunks to return|int|Default value for various `data_type` mentioned below|
|chunk_overlap|Overlap in characters between chunks|int|Default value for various `data_type` mentioned below|
|length_function|Function that measures the length of given chunks|typing.Callable|Default value for various `data_type` mentioned below|

Default values of chunker config parameters for different `data_type`:

|data_type|chunk_size|chunk_overlap|length_function|
|---|---|---|---|
|docx|1000|0|len|
|text|300|0|len|
|qna_pair|300|0|len|
|web_page|500|0|len|
|pdf_file|1000|0|len|
|youtube_video|2000|0|len|

##### **Loader Config**

_coming soon_

#### **Query Config**

|option|description|type|default|
|---|---|---|---|
|template|custom template for prompt|Template|Template("Use the following pieces of context to answer the query at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. \$context Query: $query Helpful Answer:")|
|template|custom template for prompt|Template|Template("Use the following pieces of context to answer the query at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. \$context Query: \$query Helpful Answer:")|
|history|include conversation history from your client or database|any (recommendation: list[str])|None
|stream|control if response is streamed back to the user|bool|False|

Expand Down Expand Up @@ -523,6 +559,27 @@ embedchain is a framework which takes care of all these nuances and provides a s

In the first release, we are making it easier for anyone to get a chatbot over any dataset up and running in less than a minute. All you need to do is create an app instance, add the data sets using `.add` function and then use `.query` function to get the relevant answer.

# Contribution Guidelines

Thank you for your interest in contributing to the EmbedChain project! We welcome your ideas and contributions to help improve the project. Please follow the instructions below to get started:

1. **Fork the repository**: Click on the "Fork" button at the top right corner of this repository page. This will create a copy of the repository in your own GitHub account.

2. **Install the required dependencies**: Ensure that you have the necessary dependencies installed in your Python environment. You can do this by running the following command:

```bash
make install
```

3. **Make changes in the code**: Create a new branch in your forked repository and make your desired changes in the codebase.
4. **Format code**: Before creating a pull request, it's important to ensure that your code follows our formatting guidelines. Run the following commands to format the code:

```bash
make lint format
```

5. **Create a pull request**: When you are ready to contribute your changes, submit a pull request to the EmbedChain repository. Provide a clear and descriptive title for your pull request, along with a detailed description of the changes you have made.

# Tech Stack

embedchain is built on the following stack:
Expand Down
2 changes: 1 addition & 1 deletion embedchain/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
from .embedchain import App, OpenSourceApp, PersonApp, PersonOpenSourceApp
from .embedchain import App, OpenSourceApp, PersonApp, PersonOpenSourceApp
9 changes: 6 additions & 3 deletions embedchain/chunkers/base_chunker.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,17 @@

class BaseChunker:
def __init__(self, text_splitter):
"""Initialize the chunker."""
self.text_splitter = text_splitter

def create_chunks(self, loader, src):
"""
Loads data and chunks it.
:param loader: The loader which's `load_data` method is used to create the raw data.
:param src: The data to be handled by the loader. Can be a URL for remote sources or local content for local loaders.
:param loader: The loader which's `load_data` method is used to create
the raw data.
:param src: The data to be handled by the loader. Can be a URL for
remote sources or local content for local loaders.
"""
documents = []
ids = []
Expand All @@ -26,7 +29,7 @@ def create_chunks(self, loader, src):

for chunk in chunks:
chunk_id = hashlib.sha256((chunk + url).encode()).hexdigest()
if (idMap.get(chunk_id) is None):
if idMap.get(chunk_id) is None:
idMap[chunk_id] = True
ids.append(chunk_id)
documents.append(chunk)
Expand Down
12 changes: 9 additions & 3 deletions embedchain/chunkers/docx_file.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
from embedchain.chunkers.base_chunker import BaseChunker
from typing import Optional

from langchain.text_splitter import RecursiveCharacterTextSplitter

from embedchain.chunkers.base_chunker import BaseChunker
from embedchain.config.AddConfig import ChunkerConfig

TEXT_SPLITTER_CHUNK_PARAMS = {
"chunk_size": 1000,
Expand All @@ -11,6 +13,10 @@


class DocxFileChunker(BaseChunker):
def __init__(self):
text_splitter = RecursiveCharacterTextSplitter(**TEXT_SPLITTER_CHUNK_PARAMS)
"""Chunker for .docx file."""

def __init__(self, config: Optional[ChunkerConfig] = None):
if config is None:
config = TEXT_SPLITTER_CHUNK_PARAMS
text_splitter = RecursiveCharacterTextSplitter(**config)
super().__init__(text_splitter)
14 changes: 10 additions & 4 deletions embedchain/chunkers/pdf_file.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
from embedchain.chunkers.base_chunker import BaseChunker
from typing import Optional

from langchain.text_splitter import RecursiveCharacterTextSplitter

from embedchain.chunkers.base_chunker import BaseChunker
from embedchain.config.AddConfig import ChunkerConfig

TEXT_SPLITTER_CHUNK_PARAMS = {
"chunk_size": 1000,
Expand All @@ -11,6 +13,10 @@


class PdfFileChunker(BaseChunker):
def __init__(self):
text_splitter = RecursiveCharacterTextSplitter(**TEXT_SPLITTER_CHUNK_PARAMS)
super().__init__(text_splitter)
"""Chunker for PDF file."""

def __init__(self, config: Optional[ChunkerConfig] = None):
if config is None:
config = TEXT_SPLITTER_CHUNK_PARAMS
text_splitter = RecursiveCharacterTextSplitter(**config)
super().__init__(text_splitter)
12 changes: 9 additions & 3 deletions embedchain/chunkers/qna_pair.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
from embedchain.chunkers.base_chunker import BaseChunker
from typing import Optional

from langchain.text_splitter import RecursiveCharacterTextSplitter

from embedchain.chunkers.base_chunker import BaseChunker
from embedchain.config.AddConfig import ChunkerConfig

TEXT_SPLITTER_CHUNK_PARAMS = {
"chunk_size": 300,
Expand All @@ -11,6 +13,10 @@


class QnaPairChunker(BaseChunker):
def __init__(self):
text_splitter = RecursiveCharacterTextSplitter(**TEXT_SPLITTER_CHUNK_PARAMS)
"""Chunker for QnA pair."""

def __init__(self, config: Optional[ChunkerConfig] = None):
if config is None:
config = TEXT_SPLITTER_CHUNK_PARAMS
text_splitter = RecursiveCharacterTextSplitter(**config)
super().__init__(text_splitter)
12 changes: 9 additions & 3 deletions embedchain/chunkers/text.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
from embedchain.chunkers.base_chunker import BaseChunker
from typing import Optional

from langchain.text_splitter import RecursiveCharacterTextSplitter

from embedchain.chunkers.base_chunker import BaseChunker
from embedchain.config.AddConfig import ChunkerConfig

TEXT_SPLITTER_CHUNK_PARAMS = {
"chunk_size": 300,
Expand All @@ -11,6 +13,10 @@


class TextChunker(BaseChunker):
def __init__(self):
text_splitter = RecursiveCharacterTextSplitter(**TEXT_SPLITTER_CHUNK_PARAMS)
"""Chunker for text."""

def __init__(self, config: Optional[ChunkerConfig] = None):
if config is None:
config = TEXT_SPLITTER_CHUNK_PARAMS
text_splitter = RecursiveCharacterTextSplitter(**config)
super().__init__(text_splitter)
12 changes: 9 additions & 3 deletions embedchain/chunkers/web_page.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
from embedchain.chunkers.base_chunker import BaseChunker
from typing import Optional

from langchain.text_splitter import RecursiveCharacterTextSplitter

from embedchain.chunkers.base_chunker import BaseChunker
from embedchain.config.AddConfig import ChunkerConfig

TEXT_SPLITTER_CHUNK_PARAMS = {
"chunk_size": 500,
Expand All @@ -11,6 +13,10 @@


class WebPageChunker(BaseChunker):
def __init__(self):
text_splitter = RecursiveCharacterTextSplitter(**TEXT_SPLITTER_CHUNK_PARAMS)
"""Chunker for web page."""

def __init__(self, config: Optional[ChunkerConfig] = None):
if config is None:
config = TEXT_SPLITTER_CHUNK_PARAMS
text_splitter = RecursiveCharacterTextSplitter(**config)
super().__init__(text_splitter)
14 changes: 10 additions & 4 deletions embedchain/chunkers/youtube_video.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
from embedchain.chunkers.base_chunker import BaseChunker
from typing import Optional

from langchain.text_splitter import RecursiveCharacterTextSplitter

from embedchain.chunkers.base_chunker import BaseChunker
from embedchain.config.AddConfig import ChunkerConfig

TEXT_SPLITTER_CHUNK_PARAMS = {
"chunk_size": 2000,
Expand All @@ -11,6 +13,10 @@


class YoutubeVideoChunker(BaseChunker):
def __init__(self):
text_splitter = RecursiveCharacterTextSplitter(**TEXT_SPLITTER_CHUNK_PARAMS)
super().__init__(text_splitter)
"""Chunker for Youtube video."""

def __init__(self, config: Optional[ChunkerConfig] = None):
if config is None:
config = TEXT_SPLITTER_CHUNK_PARAMS
text_splitter = RecursiveCharacterTextSplitter(**config)
super().__init__(text_splitter)
38 changes: 36 additions & 2 deletions embedchain/config/AddConfig.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,42 @@
from typing import Callable, Optional

from embedchain.config.BaseConfig import BaseConfig


class ChunkerConfig(BaseConfig):
"""
Config for the chunker used in `add` method
"""

def __init__(
self,
chunk_size: Optional[int] = 4000,
chunk_overlap: Optional[int] = 200,
length_function: Optional[Callable[[str], int]] = len,
):
self.chunk_size = chunk_size
self.chunk_overlap = chunk_overlap
self.length_function = length_function


class LoaderConfig(BaseConfig):
"""
Config for the chunker used in `add` method
"""

def __init__(self):
pass


class AddConfig(BaseConfig):
"""
Config for the `add` method.
"""
def __init__(self):
pass

def __init__(
self,
chunker: Optional[ChunkerConfig] = None,
loader: Optional[LoaderConfig] = None,
):
self.loader = loader
self.chunker = chunker
1 change: 1 addition & 0 deletions embedchain/config/BaseConfig.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ class BaseConfig:
"""
Base config.
"""

def __init__(self):
pass

Expand Down
Loading

0 comments on commit abf2559

Please sign in to comment.