Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unable to handle new lines with proposed fix #235

Closed
progressEdd opened this issue Oct 11, 2023 · 4 comments
Closed

unable to handle new lines with proposed fix #235

progressEdd opened this issue Oct 11, 2023 · 4 comments

Comments

@progressEdd
Copy link

When a string with multiple line breaks is passed into a chain, kor is unable to parse the raw text into the schema.
Solution: adding delimiters such as ``` between the text improves extraction. You can see it with the example at the end

pip show kor
Name: kor
Version: 1.0.0

for example

test_lots_of_new_lines = """ long extra text a the start of the document. value that will get extracted: name: John Smith





                                            probably will be missed main street: 123 main st




"""

will return

{'data': {'customer_info': [{'MainStreet': '',
    'MainZipCode': '',
    'MainCity': '',
    'MainState': 'John',
    'CustomerFirstName': 'Smith',
    'CustomerLastName': '',
    'CustomerEmail': '',
    'CustomerPhone': '',
    'State': ''}]},
 'raw': 'MainStreet|MainZipCode|MainCity|MainState|CustomerFirstName|CustomerLastName|CustomerEmail|CustomerPhone|State\n"123 main st"||||John|Smith||||\n',
 'errors': [],
 'validated_data': {}}

only the name is returned to the dictionary, but it will show up in the raw

for further debugging here's the output of chain.prompt.format_prompt(text=test_lots_of_new_lines).to_string(), to work with github formatting, I removed the codeblock from the promptt

Your goal is to extract structured information from the user's input that matches the form described below. When extracting information please make sure it matches the type information exactly. Do not add any attributes that do not appear in the schema shown below.

TypeScript

customer_info: Array<{ // Customer information from a narrative.
 MainStreet: string // street address
 MainZipCode: string // zip code, usually 5 digits
 MainCity: string // US city of the business or customer
 MainState: string // US state of the business or customer
 CustomerFirstName: string // first name of the customer
 CustomerLastName: string // last name of the customer
 CustomerEmail: string // email address of the customer
 CustomerPhone: string // business phone of the customer
 State: string // service state
}>



Please output the extracted information in CSV format in Excel dialect. Please use a | as the delimiter. 
 Do NOT add any clarifying information. Output MUST follow the schema above. Do NOT add any additional columns that do not appear in the schema.



Input: MainStreet
Output: MainStreet|MainZipCode|MainCity|MainState|CustomerFirstName|CustomerLastName|CustomerEmail|CustomerPhone|State
"Contact Address 1 & Contact Address 2 
 123 Industrial Rd, 5678 E Pine St"||||||||

Input: MainZipCode
Output: MainStreet|MainZipCode|MainCity|MainState|CustomerFirstName|CustomerLastName|CustomerEmail|CustomerPhone|State
|"Contact Zip 
 12345"|||||||

Input: MainCity
Output: MainStreet|MainZipCode|MainCity|MainState|CustomerFirstName|CustomerLastName|CustomerEmail|CustomerPhone|State
||"Contact City 
 Reno"||||||

Input: MainState
Output: MainStreet|MainZipCode|MainCity|MainState|CustomerFirstName|CustomerLastName|CustomerEmail|CustomerPhone|State
|||"Contact State 
 California"|||||

Input: CustomerFirstName
Output: MainStreet|MainZipCode|MainCity|MainState|CustomerFirstName|CustomerLastName|CustomerEmail|CustomerPhone|State
||||"Contact First 
 John"||||

Input: CustomerLastName
Output: MainStreet|MainZipCode|MainCity|MainState|CustomerFirstName|CustomerLastName|CustomerEmail|CustomerPhone|State
|||||"Contact Last 
 Smith"|||

Input: CustomerEmail
Output: MainStreet|MainZipCode|MainCity|MainState|CustomerFirstName|CustomerLastName|CustomerEmail|CustomerPhone|State
||||||"Contact Email 
 [email protected]"||

Input: CustomerPhone
Output: MainStreet|MainZipCode|MainCity|MainState|CustomerFirstName|CustomerLastName|CustomerEmail|CustomerPhone|State
|||||||"Contact Business Phone 
 123-123-1234"|

Input: State
Output: MainStreet|MainZipCode|MainCity|MainState|CustomerFirstName|CustomerLastName|CustomerEmail|CustomerPhone|State
||||||||"Service State 
 Florida"

Input:  long extra text a the start of the document. value that will get extracted: name: John Smith





                                            probably will be missed main steet: 123 main st





Output:

When I add delimiters, this is what I am able to get mainstreet
chain.run(text="```"+test_lots_of_new_lines+"```")

{'data': {'customer_info': [{'MainStreet': '123 main st',
    'MainZipCode': '',
    'MainCity': '',
    'MainState': '',
    'CustomerFirstName': 'John',
    'CustomerLastName': 'Smith',
    'CustomerEmail': '',
    'CustomerPhone': '',
    'State': ''}]},
 'raw': 'MainStreet|MainZipCode|MainCity|MainState|CustomerFirstName|CustomerLastName|CustomerEmail|CustomerPhone|State\r\n"123 main st"||||John|Smith|||\r\n',
 'errors': [],
 'validated_data': {}}
@eyurtsev
Copy link
Owner

Hi @progressEdd, thanks for the issue report. create_extraction_chain allows you to pass in an input formatter.

Here's the api reference: https://eyurtsev.github.io/kor/generated/kor.html#kor.create_extraction_chain

You can do something like this:

# For CSV encoding
chain = create_extraction_chain(llm, node, encoder_or_encoder_class="csv", input_formatter="triple_quotes")

# For JSON encoding
chain = create_extraction_chain(llm, node, encoder_or_encoder_class="json",
                                input_formatter="triple_quotes")

or pass in a callable to apply whatever formatting you want.

Another trick is to collapse a lot of contiguous whitespace -- it'll improve the results and reduce the token count.

@progressEdd
Copy link
Author

progressEdd commented Oct 12, 2023

gotcha I must have missed it, I was looking at the example code pages, and didn't look too in depth into the module page.

I'll have to look into input_formatter because that's cleaner than the string concatenation. Does CSV encoder offer advantages (in token usage or llm performance) over passing the data as a plain text and trimming white space?

Another trick is to collapse a lot of contiguous whitespace -- it'll improve the results and reduce the token count.
is there any functions that can do that? I was using this function I adapted

def count_metrics(text):
    ic(len(text))
    # Remove excessive newlines and consecutive whitespaces
    text = re.sub(r"\n{2,}", "\n", text)
    text = re.sub(r"\s{2,}", " ", text)

    # Calculate the number of words, characters, and tokens
    word_count = len(text.split())
    char_count = len(text)
    token_count = len(text.split())  # Assuming each word is a token

    ic(word_count,char_count,token_count)
    return text

I got it from
https://github.com/ajitdash/pview/blob/05a27d739d0dbae44e212975f400fc4a58b99937/Token%20Optimization%20-Tool/toklimit4f.py#L46

which was referenced by
https://techcommunity.microsoft.com/t5/healthcare-and-life-sciences/unlocking-the-power-of-tokens-optimizing-token-usage-in-gpt-for/ba-p/3826665

@eyurtsev
Copy link
Owner

Does CSV encoder offer advantages (in token usage or llm performance) over passing the data as a plain text and trimming white space?

The encoders are used to encode the desired output from the model rather than the input (e.g., plain text).

It tells the LLM how to structure its output so that the output can be parsed into structured representation.

JSON is more flexible and supports more complex structured data representations. It also uses more tokens.
CSV is less flexible, but potentially more reliable. It uses less tokens.

To get a sense of what's going, you follow the tutorial and try out both encoders and print out the prompt into the LLM:
https://eyurtsev.github.io/kor/objects.html#what-s-the-actual-prompt

@progressEdd
Copy link
Author

gotcha that makes sense, I thought the csv encoder had the capability of truncating/processing input text. I'll rely on the the code I shared earlier

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants