kor.exceptions.ParseError(pandas.errors.ParserError('Error tokenizing data. C error: Expected 1 fields in line 4, saw 3\n #283

dantepalacio · 2024-04-18T11:46:16Z

Hi, I want to use kor with opensource openchat model(https://huggingface.co/openchat/openchat-3.5-0106). I know that this model has a certain suffix for prompt. Here is an example below:
GPT4 Correct User: Hello<|end_of_turn|>GPT4 Correct Assistant: Hi<|end_of_turn|>GPT4 Correct User: How are you today?<|end_of_turn|>GPT4 Correct Assistant:

My current code:

import torch

from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model = "openchat/openchat-3.5-0106"

tokenizer = AutoTokenizer.from_pretrained(model_id)

pipeline = pipeline(
    "text-generation", #task
    model=model,
    tokenizer=tokenizer,
    trust_remote_code=True,
    device_map="auto",
    torch_dtype=torch.float16,
    max_length=1000,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,

)
hf = HuggingFacePipeline(pipeline=pipeline, model_kwargs={'temperature':0})

from langchain.prompts import PromptTemplate

INSTRUCTION_TEMPLATE = PromptTemplate(
    input_variables=["type_description", "format_instructions"],
    template='''GPT4 Correct System:Your goal is to extract structured information from the user's input that
matches the form described below. When extracting information please make
sure it matches the type information exactly. Do not add any attributes that
do not appear in the schema shown below.<|end_of_turn|>\n\n
GPT4 Correct User:
{type_description}\n\n
{format_instructions}<|end_of_turn|>\n\n
GPT4 Correct Assistant:''')


from kor.extraction import create_extraction_chain
from kor.nodes import Object, Text, Number


schema = Object(
    id="person",
    description="Personal information",
    examples=[
        ("Alice and Bob are friends", [{"first_name": "Alice"}, {"first_name": "Bob"}])
    ],
    attributes=[
        Text(
            id="first_name",
            description="The first name of a person.",
        )
    ],
    many=True,
)

chain = create_extraction_chain(hf, schema, instruction_template=INSTRUCTION_TEMPLATE)
chain.run(("My name is Bobby. My brother's name Joe."))

When I insert these suffixes into the prompt and run the chain, the generation goes through, but an error comes out on output:

{'data': {},
 'raw': "GPT4 Correct System:Your goal is to extract structured information from the user's input that\nmatches the form described below. When extracting information please make\nsure it matches the type information exactly. Do not add any attributes that\ndo not appear in the schema shown below.<|end_of_turn|>\n\n\nGPT4 Correct User:\n```TypeScript\n\nperson: Array<{ // Personal information\n first_name: string // The first name of a person.\n}>\n```\n\n\n\nPlease output the extracted information in CSV format in Excel dialect. Please use a | as the delimiter. \n Do NOT add any clarifying information. Output MUST follow the schema above. Do NOT add any additional columns that do not appear in the schema.<|end_of_turn|>\n\n\nGPT4 Correct Assistant:\n\nInput: Alice and Bob are friends\nOutput: first_name\nAlice\nBob\n\nInput: My name is Bobby. My brother's name Joe.\nOutput: first_name\nBobby\nJoe",
 'errors': [kor.exceptions.ParseError(pandas.errors.ParserError('Error tokenizing data. C error: Expected 1 fields in line 4, saw 3\n'))],
 'validated_data': {}}

I realize it is because of the suffixes, but how can I avoid it? what do I need to rewrite?

And if I don't specify the instruction_template parameter in the create_extraction_chain function, it can take 15-20 minutes to run the chain and give me complete nonsense.

Any help would be appreciated
updated

The text was updated successfully, but these errors were encountered:

dantepalacio closed this as completed Apr 18, 2024

dantepalacio reopened this Apr 18, 2024

dantepalacio changed the title ~~KeyError: 'generated_text'~~ kor.exceptions.ParseError(pandas.errors.ParserError('Error tokenizing data. C error: Expected 1 fields in line 4, saw 3\n Apr 18, 2024

dantepalacio closed this as completed Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kor.exceptions.ParseError(pandas.errors.ParserError('Error tokenizing data. C error: Expected 1 fields in line 4, saw 3\n #283

kor.exceptions.ParseError(pandas.errors.ParserError('Error tokenizing data. C error: Expected 1 fields in line 4, saw 3\n #283

dantepalacio commented Apr 18, 2024 •

edited

Loading

kor.exceptions.ParseError(pandas.errors.ParserError('Error tokenizing data. C error: Expected 1 fields in line 4, saw 3\n #283

kor.exceptions.ParseError(pandas.errors.ParserError('Error tokenizing data. C error: Expected 1 fields in line 4, saw 3\n #283

Comments

dantepalacio commented Apr 18, 2024 • edited Loading

dantepalacio commented Apr 18, 2024 •

edited

Loading