Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

__init__() got an unexpected keyword argument 'chunker' while using customized chunker #265

Closed
YunchengLiang opened this issue Jul 14, 2023 · 5 comments

Comments

@YunchengLiang
Copy link

image

I also tried the demo in readme, same err appears. Could you please help me?

@cachho
Copy link
Contributor

cachho commented Jul 14, 2023

Can you try with the first level of the dict removed, so just

pdf_add_config = {
    "chunk_size": 200,
    ...

Linking #251

@YunchengLiang
Copy link
Author

Can you try with the first level of the dict removed, so just

pdf_add_config = {
    "chunk_size": 200,
    ...

Linking #251

image
I tried but it has the following error ...

@cachho
Copy link
Contributor

cachho commented Jul 14, 2023

I think you need to say AddConfig(chunker=**pdf_add_config). I made a PR #270 that should simplify this issue.

@YunchengLiang
Copy link
Author

chunker=**pdf_add_config

image
it says it is not the correct syntax, and if i remove the **, then it would say unexpected key argument "chunker" again...

@YunchengLiang
Copy link
Author

YunchengLiang commented Jul 15, 2023

I think you need to say AddConfig(chunker=**pdf_add_config). I made a PR #270 that should simplify this issue.

me and my colleague found a temporary solution by having code as below,

from typing import Callable, Optional
from embedchain.config.BaseConfig import BaseConfig

class ChunkerConfig(BaseConfig):
    def __init__(
        self,
        chunk_size: Optional[int] = 4000,
        chunk_overlap: Optional[int] = 200,
        length_function: Optional[Callable[[str], int]] = len,
    ):
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.length_function = length_function


from embedchain.config.AddConfig import *

chunker = {
                "chunk_size": 200,
                "chunk_overlap": 20,
                "length_function": len
        }

cc = ChunkerConfig(**chunker)
ac = AddConfig()
ac.chunker = cc
print(ac.chunker.chunk_size)

pdf_url = 'https://www.rogers.com/cms/pdf/en/Consumer_SUG_V20.pdf' #online resources
chat_bot.add('pdf_file', pdf_url, config=ac) 

This seems to work but it generate the same number of chunks (total 55 for this file) no matter which chunk_size we config.
We indeed remember to clear the db by delete everything under it every time before we change the chunk_size paramater. But it still output the same number of chunks....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants