Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kor.exceptions.ParseError(pandas.errors.ParserError('Error tokenizing data. C error: Expected 1 fields in line 4, saw 3\n #283

Closed
dantepalacio opened this issue Apr 18, 2024 · 0 comments

Comments

@dantepalacio
Copy link

dantepalacio commented Apr 18, 2024

Hi, I want to use kor with opensource openchat model(https://huggingface.co/openchat/openchat-3.5-0106). I know that this model has a certain suffix for prompt. Here is an example below:
GPT4 Correct User: Hello<|end_of_turn|>GPT4 Correct Assistant: Hi<|end_of_turn|>GPT4 Correct User: How are you today?<|end_of_turn|>GPT4 Correct Assistant:

My current code:

import torch

from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model = "openchat/openchat-3.5-0106"

tokenizer = AutoTokenizer.from_pretrained(model_id)

pipeline = pipeline(
    "text-generation", #task
    model=model,
    tokenizer=tokenizer,
    trust_remote_code=True,
    device_map="auto",
    torch_dtype=torch.float16,
    max_length=1000,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,

)
hf = HuggingFacePipeline(pipeline=pipeline, model_kwargs={'temperature':0})

from langchain.prompts import PromptTemplate

INSTRUCTION_TEMPLATE = PromptTemplate(
    input_variables=["type_description", "format_instructions"],
    template='''GPT4 Correct System:Your goal is to extract structured information from the user's input that
matches the form described below. When extracting information please make
sure it matches the type information exactly. Do not add any attributes that
do not appear in the schema shown below.<|end_of_turn|>\n\n
GPT4 Correct User:
{type_description}\n\n
{format_instructions}<|end_of_turn|>\n\n
GPT4 Correct Assistant:''')


from kor.extraction import create_extraction_chain
from kor.nodes import Object, Text, Number


schema = Object(
    id="person",
    description="Personal information",
    examples=[
        ("Alice and Bob are friends", [{"first_name": "Alice"}, {"first_name": "Bob"}])
    ],
    attributes=[
        Text(
            id="first_name",
            description="The first name of a person.",
        )
    ],
    many=True,
)

chain = create_extraction_chain(hf, schema, instruction_template=INSTRUCTION_TEMPLATE)
chain.run(("My name is Bobby. My brother's name Joe."))


When I insert these suffixes into the prompt and run the chain, the generation goes through, but an error comes out on output:

{'data': {},
 'raw': "GPT4 Correct System:Your goal is to extract structured information from the user's input that\nmatches the form described below. When extracting information please make\nsure it matches the type information exactly. Do not add any attributes that\ndo not appear in the schema shown below.<|end_of_turn|>\n\n\nGPT4 Correct User:\n```TypeScript\n\nperson: Array<{ // Personal information\n first_name: string // The first name of a person.\n}>\n```\n\n\n\nPlease output the extracted information in CSV format in Excel dialect. Please use a | as the delimiter. \n Do NOT add any clarifying information. Output MUST follow the schema above. Do NOT add any additional columns that do not appear in the schema.<|end_of_turn|>\n\n\nGPT4 Correct Assistant:\n\nInput: Alice and Bob are friends\nOutput: first_name\nAlice\nBob\n\nInput: My name is Bobby. My brother's name Joe.\nOutput: first_name\nBobby\nJoe",
 'errors': [kor.exceptions.ParseError(pandas.errors.ParserError('Error tokenizing data. C error: Expected 1 fields in line 4, saw 3\n'))],
 'validated_data': {}}

I realize it is because of the suffixes, but how can I avoid it? what do I need to rewrite?

And if I don't specify the instruction_template parameter in the create_extraction_chain function, it can take 15-20 minutes to run the chain and give me complete nonsense.

Any help would be appreciated
updated

@dantepalacio dantepalacio reopened this Apr 18, 2024
@dantepalacio dantepalacio changed the title KeyError: 'generated_text' kor.exceptions.ParseError(pandas.errors.ParserError('Error tokenizing data. C error: Expected 1 fields in line 4, saw 3\n Apr 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant